[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

Fredrik Lundh Fri, 12 Mar 2010 01:06:08 -0800

Fredrik Lundh <fred...@effbot.org> added the comment:

"'None' has always been the documented default for the encoding parameter"


That's probably mostly by accident at least in original ET, but the 1.3 draft 
docs at effbot.org/elementtree does spell it out explicitly for the 'write' 
method:

   Output encoding. If omitted or set to None, defaults to US-ASCII.

Not sure I'd consider this text binding in itself, though (even if I'd argue 
that it's preferred to have the same interpretation of encoding everywhere).

"writing out the Unicode serialisation will result in an incorrect XML 
serialisation"

I think Guido meant the ElementTree.write method; is that broken too?

The file.write(et.tostring()) issue is probably my most pressing concern here; 
that's a common use case (e.g. when using "iterparse" to cut pieces from a big 
document), and the defaults were chosen to increase the chance that this 
automatically do the right thing for non-ASCII even if the programmer never 
tests it.  In 3.X, that construct is suddenly dependent on the interpreter's 
default encoding.

I think I'd prefer old "tostring" behaviour and a separate "tounicode" 
function, and I'm still not convinced that the latter is required for the XML 
use case (which implies that maybe it should live in lxml.html for the HTML 
case, even if it ends up calling the same internal implementation).

Or should that be "tobytes" and "tounicode" to eliminate all ambiguity?

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue8047>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

Reply via email to