On 10 Dec 2003, at 18:58, Tom Bradford wrote:


Sorry, but this is untrue. The XML document prolog accepts an 'encoding' directive for a reason, and that is so that it can properly parse a document that uses a specific character set. Characters like extended latin, katakana, cyrillic, and such can all be represented by UTF-8 encoding without expressing them as entities.

The problem is that the Apache XML-RPC library, even though it supports the ability to force the XML document prolog's encoding, has a bug in the XMLWriter class when it comes to characters above 0xFF, so anything other than the basic latin set will throw your error, even though according to the XML spec, those characters are perfectly legal for a document.


Nice to see you on the list, Tom.

What I would propose is that the default encoding remain as ISO 8859/1 (so we don't break the non UTF-x aware implementations which exist today) and to allow *only* UTF-8 and UTF-16 to be specified as alternate encodings. You can't support arbitrary encodings unless you know the mappings of Unicode code points onto the encoding character set (i.e. you have to know which characters to escape).

We also fix the XMLWriter to do the proper escaping when using ISO 8859/1 encoding and to do no escaping otherwise.

Comments?



John Wilson
The Wilson Partnership
http://www.wilson.co.uk



Reply via email to