On Sun, 2005-04-24 at 18:29 +0100, John Wilson wrote:
>On 24 Apr 2005, at 09:37, Christoph Theis wrote:
>
>> The original spec allowed ASCII characters only for strings. The word
>> "ASCII"
>> ws removed 2003. XmlWriter still checks for the range 0x20 ... 0xff
>> and not for the range allowed by the XML spec. There had been a lot of
>> discussion over this the last years and as far as I know Apaches
>> (our) xmlrpc still clings to ASCII characters. But might be I'm wrong
>> ...
>>
>
>Apache XML-RPC uses the ISO 8859/1 encoding (it emits an XML
>declaration saying this). 8859/1 is an eight bit encoding so only
>Unicode code points up to 0XFF can be represented directly. Code points
>with values greater than this should be represented by character
>references (e.g. ǿ) I think that XmlWriter does this. I'm sorry
>but I do not have ready access to the source code from this machine so
>I can't check the details directly.
>
>The use of ISO 8859/1 has always been a bit of a puzzle to me. XML
>parsers are only required to understand UTF-8 and UTF-16 so using ISO
>8859/1 theoretically reduces interoperability. However, I do not recall
>ever hearing of such a problem in practice. My own view is that for
>maximum interoperability only code points up to 127 should be
>represented directly values > 127 should be represented by character
>references. The cost of doing this is that the number of octets used
>rises when non USASCII characters are exchanged.
Speaking of which, I'm attaching a patch I've had kicking around for a
while, which we use in production at my day job. John, I think you've
looked over this before, but I wanted to run it by the dev list one more
time before committing it to both the 1.2 and 2.0 branches.
Index: src/java/org/apache/xmlrpc/XmlWriter.java
===================================================================
RCS file: /home/cvs/ws-xmlrpc/src/java/org/apache/xmlrpc/XmlWriter.java,v
retrieving revision 1.6
diff -u -r1.6 XmlWriter.java
--- src/java/org/apache/xmlrpc/XmlWriter.java 21 Nov 2002 21:57:39 -0000 1.6
+++ src/java/org/apache/xmlrpc/XmlWriter.java 28 Apr 2005 22:45:07 -0000
@@ -312,6 +312,11 @@
throws XmlRpcException, IOException
{
int l = text.length ();
+ String enc = super.getEncoding();
+ boolean isUnicode = UTF8.equals(enc) || "UTF-16".equals(enc);
+ // ### TODO: Use a buffer rather than going character by
+ // ### character to scale better for large text sizes.
+ //char[] buf = new char[32];
for (int i = 0; i < l; i++)
{
char c = text.charAt (i);
@@ -332,16 +337,38 @@
write(AMPERSAND_ENTITY);
break;
default:
- if (c < 0x20 || c > 0xff)
+ if (c < 0x20 || c > 0x7f)
{
// Though the XML-RPC spec allows any ASCII
// characters except '<' and '&', the XML spec
// does not allow this range of characters,
// resulting in a parse error from most XML
- // parsers.
- throw new XmlRpcException(0, "Invalid character data " +
- "corresponding to XML entity &#" +
- String.valueOf((int) c) + ';');
+ // parsers. However, the XML spec does require
+ // XML parsers to support UTF-8 and UTF-16.
+ if (isUnicode)
+ {
+ if (c < 0x20)
+ {
+ // Entity escape the character.
+ write("&#");
+ // ### Do we really need the String conversion?
+ write(String.valueOf((int) c));
+ write(';');
+ }
+ else // c > 0x7f
+ {
+ // Write the character in our encoding.
+ write(new String(String.valueOf(c).getBytes(enc)));
+ }
+ }
+ else
+ {
+ throw new XmlRpcException(0, "Invalid character data "
+ + "corresponding to XML "
+ + "entity &#"
+ + String.valueOf((int) c)
+ + ';');
+ }
}
else
{