[issue26369] unicode.decode and str.encode are unnecessarily confusing for non-ascii

Ben Spiller Fri, 20 May 2016 02:34:54 -0700

Ben Spiller added the comment:

Thanks for considering this, anyway. I'll admit I'm disappointed we couldn't 
fix this on the 2.7 train, as to me fixing a method that takes an 
errors='ignore' argument and then throws an exception anyway seems a little 
more like a bug than a feature (and changing it would likely not affect 
behaviour in any existing non-broken programs), but if that's the decision then 
fine. Of course I'm aware (as I mentioned earlier on the thread) that the 
radically different unicode handling in python 3 solves this entirely and only 
wish it was practical to move our existing (enormous) codebase and customers 
over to it, but we're stuck with Python 2.7 - I believe lots of people are in 
the same situation unfortunately.


As Josh suggested, perhaps we can at least add something to the doc for the 
str/unicode encode and decode methods so users are aware of the behaviour 
without trial and error. I'll update the component of this bug to reflect it's 
now considered a doc issue. 

Based on the inputs from Terry, and what seem to be the key info that would 
have been helpful to me and those who are hitting the same issues for the first 
time, I'd propose the following text (feel free to adjust as you see fit):

For encode:
"For most encodings, the return type is a byte str regardless of whether it is 
called on a str or unicode object. For example, call encode on a unicode object 
with "utf-8" to return a byte str object, or call encode on a str object with 
"base64" to return a base64-encoded str object.

It is _not_ recommended to use call this method on "str" objects when using 
codecs such as utf-8 that convert betweens str and unicode objects, as any 
characters not supported by python's default encoding (usually 7-bit ascii) 
will result in a UnicodeDecodeError exception, even if errors='ignore' was 
specified. For such conversions the str.decode and unicode.encode methods 
should be used. If you need to produce an encoded version of a string that 
could be either a str or unicode object, only call the encode() method after 
checking it is a unicode object not a str object, using isinstance(s, unicode)."

and for decode:
"The return type may be either str or unicode, depending on which encoding is 
used and whether the method is called on a str or unicode object. For example, 
call decode on a str object with "utf-8" to return a unicode object, or call 
decode on a unicode or str object with "base64" to return a base64-decoded str 
object.

It is _not_ recommended to use call this method on "unicode" objects when using 
codecs such as utf-8 that convert betweens str and unicode objects, as any 
characters not supported by python's default encoding (usually 7-bit ascii) 
will result in a UnicodeEncodeError exception, even if errors='ignore' was 
specified. For such conversions the str.decode and unicode.encode methods 
should be used. If you need to produce a decoded version of a string that could 
be either a str or unicode object, only call the decode() method after checking 
it is a str object not a unicode object, using isinstance(s, str)."

----------
components: +Documentation -Interpreter Core

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue26369>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue26369] unicode.decode and str.encode are unnecessarily confusing for non-ascii

Reply via email to