Ben Spiller added the comment:
Thanks that's really helpful
Having thought about it some more, I think if possible it'd be really so much
better to actually 'fix' the behaviour for the unicode<->str standard codecs
(i.e. not base64) rather than just documenting around it. The current behaviour
is not only confusing but leads to bugs that are very easy to miss since the
methods work correctly when given 7-bit ascii characters.
I had a poke around in the python source but couldn't quite identify where it's
happening - presumably there is somewhere in the str.encode('utf-8')
implementation that first "decodes" the string and does so using the ascii
codec. If it could be made to use the same encoding that was passed in (e.g.
utf8) then this would end up being a no-op and there would be no unpleasant
bugs that only appear when the input includes non-ascii characters.
It would also allow X.encode('utf-8') to be called successfully whether X is
already a str or is a unicode object, which would save callers having to
explicitly check what kind of string they've been passed.
Is anyone able to look into the code to see where this would need to be fixed
and how difficult it would be to do? I have a feeling that once the line is
located it might be quite a straightforward fix
Many thanks
----------
components: +Interpreter Core -Documentation
title: doc for unicode.decode and str.encode is unnecessarily confusing ->
unicode.decode and str.encode are unnecessarily confusing for non-ascii
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue26369>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com