Sébastien Boisgérault schrieb: > I am trying to embed an *arbitrary* (unicode) strings inside > an XML document. Of course I'd like to be able to reconstruct > it later from the xml document ... If the naive way to do it does > not work, can anyone suggest a way to do it ?
XML does not support arbitrary Unicode characters; a few control characters are excluded. See the definiton of Char in http://www.w3.org/TR/2004/REC-xml-20040204 [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] Now, one might thing you could use a character reference (e.g. �) to refer to the "missing" characters, but this is not so: [66] CharRef ::= '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ '; Well-formedness constraint: Legal Character Characters referred to using character references must match the production for Char. As others have explained, if you want to transmit arbitrary characters, you need to encode it as text in some way. One obvious solution would be to encode the Unicode data as UTF-8 first, and then encode the UTF-8 bytes using base64. The receiver of the XML document then must do the reverse. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list