[PATCH] characters invalid for an encoding

Daniel Rall Thu, 28 Apr 2005 15:51:55 -0700

On Sun, 2005-04-24 at 18:29 +0100, John Wilson wrote:
>On 24 Apr 2005, at 09:37, Christoph Theis wrote:
>
>> The original spec allowed ASCII characters only for strings. The word 
>> "ASCII"
>> ws removed 2003. XmlWriter still checks for the range 0x20 ... 0xff
>> and not for the range allowed by the XML spec. There had been a lot of
>> discussion over this the last years and as far as I know Apaches
>> (our) xmlrpc still clings to ASCII characters. But might be I'm wrong 
>> ...
>>
>
>Apache XML-RPC uses the ISO 8859/1 encoding (it emits an XML 
>declaration saying this). 8859/1 is an eight bit encoding so only 
>Unicode code points up to 0XFF can be represented directly. Code points 
>with values greater than this should be represented by character 
>references (e.g. &#x1ff;)  I think that XmlWriter does this. I'm sorry 
>but I do not have ready access to the source code from this machine so 
>I can't check the details directly.
>
>The use of ISO 8859/1 has always been a bit of a puzzle to me. XML 
>parsers are only required to understand UTF-8 and UTF-16 so using ISO 
>8859/1 theoretically reduces interoperability. However, I do not recall 
>ever hearing of such a problem in practice. My own view is that for 
>maximum interoperability only code points up to 127 should be 
>represented directly values > 127 should be represented by character 
>references. The cost of doing this is that the number of octets used 
>rises when non USASCII characters are exchanged.


Speaking of which, I'm attaching a patch I've had kicking around for a
while, which we use in production at my day job.  John, I think you've
looked over this before, but I wanted to run it by the dev list one more
time before committing it to both the 1.2 and 2.0 branches.

Index: src/java/org/apache/xmlrpc/XmlWriter.java
===================================================================
RCS file: /home/cvs/ws-xmlrpc/src/java/org/apache/xmlrpc/XmlWriter.java,v
retrieving revision 1.6
diff -u -r1.6 XmlWriter.java
--- src/java/org/apache/xmlrpc/XmlWriter.java	21 Nov 2002 21:57:39 -0000	1.6
+++ src/java/org/apache/xmlrpc/XmlWriter.java	28 Apr 2005 22:45:07 -0000
@@ -312,6 +312,11 @@
         throws XmlRpcException, IOException
     {
         int l = text.length ();
+        String enc = super.getEncoding();
+        boolean isUnicode = UTF8.equals(enc) || "UTF-16".equals(enc);
+        // ### TODO: Use a buffer rather than going character by
+        // ### character to scale better for large text sizes.
+        //char[] buf = new char[32];
         for (int i = 0; i < l; i++)
         {
             char c = text.charAt (i);
@@ -332,16 +337,38 @@
                 write(AMPERSAND_ENTITY);
                 break;
             default:
-                if (c < 0x20 || c > 0xff)
+                if (c < 0x20 || c > 0x7f)
                 {
                     // Though the XML-RPC spec allows any ASCII
                     // characters except '<' and '&', the XML spec
                     // does not allow this range of characters,
                     // resulting in a parse error from most XML
-                    // parsers.
-                    throw new XmlRpcException(0, "Invalid character data " +
-                                              "corresponding to XML entity &#" +
-                                              String.valueOf((int) c) + ';');
+                    // parsers.  However, the XML spec does require
+                    // XML parsers to support UTF-8 and UTF-16.
+                    if (isUnicode)
+                    {
+                        if (c < 0x20)
+                        {
+                            // Entity escape the character.
+                            write("&#");
+                            // ### Do we really need the String conversion?
+                            write(String.valueOf((int) c));
+                            write(';');
+                        }
+                        else // c > 0x7f
+                        {
+                            // Write the character in our encoding.
+                            write(new String(String.valueOf(c).getBytes(enc)));
+                        }
+                    }
+                    else
+                    {
+                        throw new XmlRpcException(0, "Invalid character data "
+                                                  + "corresponding to XML "
+                                                  + "entity &#"
+                                                  + String.valueOf((int) c)
+                                                  + ';');
+                    }
                 }
                 else
                 {

[PATCH] characters invalid for an encoding

Reply via email to