On Thu, 2005-05-05 at 15:24 +0100, John Wilson wrote:
>I'm not sure I follow this either :)
>
>Currently we emit an XML declaration which says we are using
>ISO8859-1 encoding.
The declaration generated depends upon the encoding in use by XmlWriter,
no?
write(PROLOG_START);
write(canonicalizeEncoding(enc));
write(PROLOG_END);
>Unicode code points in the range 0X00 to 0XFF
>have the same value as the ISO8859-1 character values. If we wish to
>send Unicode code points with values > 0XFF then we have to emit
>character references (e.g. &x1FF;)
>
>If we were to change the encoding to UTF-8 or UTF-16 then we would
>never have to emit character references (though we still could if we
>wanted to).
Like you say below, we'd still have to emit character references for
Unicode code points not allowed in XML documents, yes?
>The XML 1.0 spec forbids some Unicode code points from appearing in a
>well formed XML document (only these code points are allowed: #x9 |
>#xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] -
>see section 2.2 of the spec.). Note that USASCII control characters
>other than HT, CR and NL are not allowed. Using a character reference
>doesn't make any difference <a>�</a> is not a well formed XML
>document and should be rejected by an XML parser (MinML used not to
>complain about this - later versions do).
What range are these control characters in (e.g. < 0x20)?
>There is another little wrinkle with CR and LF. An XML parser is
>required to "normalise" line endings (see section 2.11 of the spec).
>This normalisation involves replacing CR NL or CR with NL. This
>normalisation does not occur if the CR is represented using a
>character reference.
>
>So a correct XML writer should do the following:
>
>1/ refuse to write characters with Unicode code points which are not
>allowed in an XML document.
Do you suggest throwing an exception here, or writing a '?' character?
>2/ replace characters with a Unicode code point which is not allowed
>in the encoding being used with the appropriate character reference.
For any random encoding, anyone know a good way of determining whether
such a character is representable by said encoding?
>3/ replace <,& and > with either the pre defined entities (< etc)
>or with a character reference.
We're already replacing them with pref-defined entities, so we're in
good shape here.
>4/ replace all CR characters with a character reference.
We do this to keep them from getting normalized by the XML parser, I
take it? Previously, we'd write them literally.
>If we wanted to have the greatest possible chance of interoperating
>we should emit no XML encoding declaration and replace code points
>with values > 0X7F with character references.
I agree with the part about replacing code points with values > 0x7f
with character references (see exchange with Jochen).
Can non-ASCII encodings be determined by the parser using the BOM, or
some such heuristic? Would we write all non-ASCII encoding as UTF-8?
I'm attaching a patch as a discussion piece which implements some of the
discussion from this thread.
Index: XmlWriter.java
===================================================================
RCS file: /home/cvs/ws-xmlrpc/src/java/org/apache/xmlrpc/XmlWriter.java,v
retrieving revision 1.13
diff -u -u -r1.13 XmlWriter.java
--- XmlWriter.java 2 May 2005 04:22:21 -0000 1.13
+++ XmlWriter.java 5 May 2005 21:36:38 -0000
@@ -40,6 +40,8 @@
*
* @author <a href="mailto:[EMAIL PROTECTED]">Hannes Wallnoefer</a>
* @author Daniel L. Rall
+ * @see <a href="http://www.xml.com/axml/testaxml.htm">Tim Bray's
+ * Annotated XML Spec</a>
*/
class XmlWriter extends OutputStreamWriter
{
@@ -246,6 +248,17 @@
}
/**
+ * Writes characters like '\r' (0xd) as "&#13;".
+ */
+ private void writeCharacterReference(char c)
+ throws IOException
+ {
+ write("&#");
+ write(String.valueOf((int) c));
+ write(';');
+ }
+
+ /**
*
* @param elem
* @throws IOException
@@ -303,10 +316,13 @@
switch (c)
{
case '\t':
- case '\r':
case '\n':
write(c);
break;
+ case '\r':
+ // Avoid normalization of CR to LF.
+ writeCharacterReference(c);
+ break;
case '<':
write(LESS_THAN_ENTITY);
break;
@@ -317,38 +333,18 @@
write(AMPERSAND_ENTITY);
break;
default:
- if (c < 0x20 || c > 0x7f)
+ // Though the XML spec requires XML parsers to support
+ // Unicode, not all such code points are valid in XML
+ // documents. Additionally, previous to 2003-06-30
+ // the XML-RPC spec only allowed ASCII data (in
+ // <string> elements). For interoperability with
+ // clients rigidly conforming to the pre-2003 version
+ // of the XML-RPC spec, we entity encode characters
+ // outside of the valid range for ASCII, too.
+ if (c > 0x7f || !isValidXMLChar(c))
{
- // Though the XML-RPC spec allows any ASCII
- // characters except '<' and '&', the XML spec
- // does not allow this range of characters,
- // resulting in a parse error from most XML
- // parsers. However, the XML spec does require
- // XML parsers to support UTF-8 and UTF-16.
- if (isUnicode)
- {
- if (c < 0x20)
- {
- // Entity escape the character.
- write("&#");
- // ### Do we really need the String conversion?
- write(String.valueOf((int) c));
- write(';');
- }
- else // c > 0x7f
- {
- // Write the character in our encoding.
- write(new String(String.valueOf(c).getBytes(enc)));
- }
- }
- else
- {
- throw new XmlRpcException(0, "Invalid character data "
- + "corresponding to XML "
- + "entity &#"
- + String.valueOf((int) c)
- + ';');
- }
+ // Replace the code point with a character reference.
+ writeCharacterReference(c);
}
else
{
@@ -358,6 +354,35 @@
}
}
+ /**
+ * Section 2.2 of the XML spec describes which Unicode code points
+ * are valid in XML:
+ *
+ * <blockquote><code>#x9 | #xA | #xD | [#x20-#xD7FF] |
+ * [#xE000-#xFFFD] | [#x10000-#x10FFFF]</code></blockquote>
+ *
+ * Code points outside this set must be entity encoded to be
+ * represented in XML.
+ *
+ * @param c The character to inspect.
+ * @return Whether the specified character is valid in XML.
+ */
+ private static final boolean isValidXMLChar(char c)
+ {
+ switch (c)
+ {
+ case 0x9:
+ case 0xa: // line feed, '\n'
+ case 0xd: // carriage return, '\r'
+ return true;
+
+ default:
+ return ( (0x20 < c && c <= 0xd7ff) ||
+ (0xe000 < c && c <= 0xfffd) ||
+ (0x10000 < c && c <= 0x10ffff) );
+ }
+ }
+
protected static void setTypeDecoder(TypeDecoder newTypeDecoder)
{
typeDecoder = newTypeDecoder;