Re: [PATCH] characters invalid for an encoding

Daniel Rall Thu, 05 May 2005 14:49:07 -0700

On Thu, 2005-05-05 at 15:24 +0100, John Wilson wrote:
>I'm not sure I follow this either :)
>
>Currently we emit an XML declaration which says we are using  
>ISO8859-1 encoding.


The declaration generated depends upon the encoding in use by XmlWriter,
no?

        write(PROLOG_START);
        write(canonicalizeEncoding(enc));
        write(PROLOG_END);

>Unicode code points in the range 0X00 to 0XFF  
>have the same value as the ISO8859-1 character values. If we wish to  
>send Unicode code points with values > 0XFF then we have to emit  
>character references (e.g. &x1FF;)
>
>If we were to change the encoding to UTF-8 or UTF-16 then we would  
>never have to emit character references (though we still could if we  
>wanted to).

Like you say below, we'd still have to emit character references for
Unicode code points not allowed in XML documents, yes?

>The XML 1.0 spec forbids some Unicode code points from appearing in a  
>well formed XML document (only these code points are allowed: #x9 |  
>#xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] -  
>see section 2.2 of the spec.). Note that USASCII control characters  
>other than HT, CR and NL are not allowed. Using a character reference  
>doesn't make any difference <a>&#x0;</a> is not a well formed XML  
>document and should be rejected by an XML parser (MinML used not to  
>complain about this - later versions do).

What range are these control characters in (e.g. < 0x20)?

>There is another little wrinkle with CR and LF. An XML parser is  
>required to "normalise" line endings (see section 2.11 of the spec).  
>This normalisation involves replacing CR NL or CR with NL. This  
>normalisation does not occur if the CR is represented using a  
>character reference.
>
>So a correct XML writer should do the following:
>
>1/ refuse to write characters with Unicode code points which are not  
>allowed in an XML document.

Do you suggest throwing an exception here, or writing a '?' character?

>2/ replace characters with a Unicode code point which is not allowed  
>in the encoding being used with the appropriate character reference.

For any random encoding, anyone know a good way of determining whether
such a character is representable by said encoding?

>3/ replace <,& and > with either the pre defined entities (&lt; etc)  
>or with a character reference.

We're already replacing them with pref-defined entities, so we're in
good shape here.

>4/ replace all CR characters with a character reference.

We do this to keep them from getting normalized by the XML parser, I
take it?  Previously, we'd write them literally.

>If we wanted to have the greatest possible chance of interoperating  
>we should emit no XML encoding declaration and replace code points  
>with values > 0X7F with character references.

I agree with the part about replacing code points with values > 0x7f
with character references (see exchange with Jochen).

Can non-ASCII encodings be determined by the parser using the BOM, or
some such heuristic?  Would we write all non-ASCII encoding as UTF-8?


I'm attaching a patch as a discussion piece which implements some of the
discussion from this thread.

Index: XmlWriter.java
===================================================================
RCS file: /home/cvs/ws-xmlrpc/src/java/org/apache/xmlrpc/XmlWriter.java,v
retrieving revision 1.13
diff -u -u -r1.13 XmlWriter.java
--- XmlWriter.java	2 May 2005 04:22:21 -0000	1.13
+++ XmlWriter.java	5 May 2005 21:36:38 -0000
@@ -40,6 +40,8 @@
  *
  * @author <a href="mailto:[EMAIL PROTECTED]">Hannes Wallnoefer</a>
  * @author Daniel L. Rall
+ * @see <a href="http://www.xml.com/axml/testaxml.htm";>Tim Bray's
+ * Annotated XML Spec</a>
  */
 class XmlWriter extends OutputStreamWriter
 {
@@ -246,6 +248,17 @@
     }
 
     /**
+     * Writes characters like '\r' (0xd) as "&amp;#13;".
+     */
+    private void writeCharacterReference(char c)
+        throws IOException
+    {
+        write("&#");
+        write(String.valueOf((int) c));
+        write(';');
+    }
+
+    /**
      *
      * @param elem
      * @throws IOException
@@ -303,10 +316,13 @@
             switch (c)
             {
             case '\t':
-            case '\r':
             case '\n':
                 write(c);
                 break;
+            case '\r':
+                // Avoid normalization of CR to LF.
+                writeCharacterReference(c);
+                break;
             case '<':
                 write(LESS_THAN_ENTITY);
                 break;
@@ -317,38 +333,18 @@
                 write(AMPERSAND_ENTITY);
                 break;
             default:
-                if (c < 0x20 || c > 0x7f)
+                // Though the XML spec requires XML parsers to support
+                // Unicode, not all such code points are valid in XML
+                // documents.  Additionally, previous to 2003-06-30
+                // the XML-RPC spec only allowed ASCII data (in
+                // <string> elements).  For interoperability with
+                // clients rigidly conforming to the pre-2003 version
+                // of the XML-RPC spec, we entity encode characters
+                // outside of the valid range for ASCII, too.
+                if (c > 0x7f || !isValidXMLChar(c))
                 {
-                    // Though the XML-RPC spec allows any ASCII
-                    // characters except '<' and '&', the XML spec
-                    // does not allow this range of characters,
-                    // resulting in a parse error from most XML
-                    // parsers.  However, the XML spec does require
-                    // XML parsers to support UTF-8 and UTF-16.
-                    if (isUnicode)
-                    {
-                        if (c < 0x20)
-                        {
-                            // Entity escape the character.
-                            write("&#");
-                            // ### Do we really need the String conversion?
-                            write(String.valueOf((int) c));
-                            write(';');
-                        }
-                        else // c > 0x7f
-                        {
-                            // Write the character in our encoding.
-                            write(new String(String.valueOf(c).getBytes(enc)));
-                        }
-                    }
-                    else
-                    {
-                        throw new XmlRpcException(0, "Invalid character data "
-                                                  + "corresponding to XML "
-                                                  + "entity &#"
-                                                  + String.valueOf((int) c)
-                                                  + ';');
-                    }
+                    // Replace the code point with a character reference.
+                    writeCharacterReference(c);
                 }
                 else
                 {
@@ -358,6 +354,35 @@
         }
     }
 
+    /**
+     * Section 2.2 of the XML spec describes which Unicode code points
+     * are valid in XML:
+     *
+     * <blockquote><code>#x9 | #xA | #xD | [#x20-#xD7FF] |
+     * [#xE000-#xFFFD] | [#x10000-#x10FFFF]</code></blockquote>
+     *
+     * Code points outside this set must be entity encoded to be
+     * represented in XML.
+     *
+     * @param c The character to inspect.
+     * @return Whether the specified character is valid in XML.
+     */
+    private static final boolean isValidXMLChar(char c)
+    {
+        switch (c)
+        {
+        case 0x9:
+        case 0xa:  // line feed, '\n'
+        case 0xd:  // carriage return, '\r'
+            return true;
+
+        default:
+            return ( (0x20 < c && c <= 0xd7ff) ||
+                     (0xe000 < c && c <= 0xfffd) ||
+                     (0x10000 < c && c <= 0x10ffff) );
+        }
+    }
+
     protected static void setTypeDecoder(TypeDecoder newTypeDecoder)
     {
         typeDecoder = newTypeDecoder;

Re: [PATCH] characters invalid for an encoding

Reply via email to