Ben Spiller added the comment:

Thanks that's really helpful

Having thought about it some more, I think if possible it'd be really so much 
better to actually 'fix' the behaviour for the unicode<->str standard codecs 
(i.e. not base64) rather than just documenting around it. The current behaviour 
is not only confusing but leads to bugs that are very easy to miss since the 
methods work correctly when given 7-bit ascii characters. 

I had a poke around in the python source but couldn't quite identify where it's 
happening - presumably there is somewhere in the str.encode('utf-8') 
implementation that first "decodes" the string and does so using the ascii 
codec. If it could be made to use the same encoding that was passed in (e.g. 
utf8) then this would end up being a no-op and there would be no unpleasant 
bugs that only appear when the input includes non-ascii characters. 

It would also allow X.encode('utf-8') to be called successfully whether X is 
already a str or is a unicode object, which would save callers having to 
explicitly check what kind of string they've been passed. 

Is anyone able to look into the code to see where this would need to be fixed 
and how difficult it would be to do? I have a feeling that once the line is 
located it might be quite a straightforward fix

Many thanks

----------
components: +Interpreter Core -Documentation
title: doc for unicode.decode and str.encode is unnecessarily confusing -> 
unicode.decode and str.encode are unnecessarily confusing for non-ascii

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue26369>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to