[issue26369] unicode.decode and str.encode are unnecessarily confusing for non-ascii

Marc-Andre Lemburg Fri, 20 May 2016 01:01:38 -0700

Marc-Andre Lemburg added the comment:

Ben, the methods on stings and Unicode objects in Python 2.x are direct 
interfaces to the underlying codecs. The codecs can handle any number of input 
and output types, so there are some which only work on 8-bit strings (bytes) 
and others which take Unicode as input.


As a result, you sometimes see errors due to the conversion of an 8-bit string 
to Unicode (in the case, where the codec expects a Unicode input).

As example, take the UTF-8 codec. This expects a Unicode input when decoding, 
so when you pass in an 8-bit string, Python will convert this to Unicode using 
the default encoding (which is normally set to 'ascii') and then applies the 
codec operation.

When the 8-bit string is plain ASCII this works great. If not, chances are high 
that you'll run into a Unicode error.

Now, in Python 2.x you can change the default encoding to either make this work 
by assuming that all your 8-bit strings are UTF-8 (set it to 'utf-8' in 
sitecustomize.py), or you can disable the automatic conversion altogether by 
setting the default encoding to 'unknown', which is a codec specifically 
created for this purpose. The latter will also raise an exception when 
attempting to convert an 8-bit string to Unicode - similar to what Python 3 
does, except that the error type is different.

Hope that helps.

----------
nosy: +lemburg

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue26369>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue26369] unicode.decode and str.encode are unnecessarily confusing for non-ascii

Reply via email to