Daniel Molina Wegener wrote: > Stefan Behnel <stefan...@behnel.de> wrote: >> Daniel Molina Wegener wrote: >>> When the object is restored, by using pyxser.unserialize: >>> >>> pyobj = pyxser.unserialize(obj = xmldocstr, enc = "utf-8") >> But this is XML, right? What do you need to pass the encoding for at this >> point? > > The user may want a different encoding, other than utf-8, it can > be any encoding supported by libxml2.
I really meant what I wrote: this is XML. The encoding is well defined in the XML declaration at the start of the document (and will default to UTF-8 if not provided). Passing it externally will allow users to override that, which doesn't make any sense at all. > if the encodings are mixed inside Python byte strings, I think > that there is no way to know which encoding are using them. Correct. > This may cause XML serialization errors Yes, but only if you try to recode the strings (which, as I said, is a no-no). >> One trick to do that is to decode the byte string as ISO-8859-1 and >> serialise the result as a normal Unicode string. Then you can re-encode >> the unicode string on input back to ISO-8859-1. > >> I choose ISO-8859-1 here because it has the well-defined side-effect of >> mapping byte values directly to Unicode characters with an identical code >> point value. So you do not risk any failures or data loss. > > Sure, but if there are Python byte strings (not Unicode strings), ones > encoded in big5 and others in iso-8859-1 inside the object tree, the > XML serialization would throw errors on the encoding conversion, by > setting those bytes inside the document... No, I really meant: decoding from ISO-8859-1 to Unicode, for all byte strings, regardless of their encoding (since you can't even know if they represent encoded text at all). So you get a unicode string that you can serialise to the target encoding, although it may result in character references (&#xyz;) being output. But you won't get any errors, at least. On the way in, you get a unicode string again, which you can encode to ISO-8859-1 to get the original byte string back. Stefan -- http://mail.python.org/mailman/listinfo/python-list