[issue26369] doc for unicode.decode and str.encode is unnecessarily confusing

Ben Spiller Tue, 16 Feb 2016 03:59:53 -0800

New submission from Ben Spiller:

It's well known that lots of people struggle writing correct programs using 
non-ascii strings in python 2.x, but I think one of the main reasons for this 
could be very easily fixed with a small addition to the documentation for 
str.encode and unicode.decode, which is currently quite vague.


The decode/encode methods really make most sense when called on a unicode 
string i.e. unicode.encode() to produce a byte string, or on a byte string e.g. 
str.decode() to produce a unicode object from a byte string. 

However, the additional presence of the opposite methods str.encode() and 
unicode.decode() is quite confusing, and a frequent source of errors - e.g. 
calling str.encode('utf-8') first DECODES the str object (which might already 
be in utf8) to a unicode string **using the default encoding of "ascii"** (!) 
before ENCODING to a utf-8 byte str as requested, which of course will fail at 
the first stage with the classic error "UnicodeDecodeError: 'ascii' codec can't 
decode byte" if there are any non-ascii chars present. It's unfortunate that 
this initial decode/encode stage ignores both the "encoding" argument (used 
only for the subsequent encode/decode) and the "errors" argument (commonly used 
when the programmer is happy with a best-effort conversion e.g. for logging 
purposes).

Anyway, given this behaviour, a lot of time would be saved by a simple sentence 
on the doc for str.encode()/unicode.decode() essentially warning people that 
those methods aren't that useful and they probably really intended to use 
str.decode()/unicode.encode() - the current doc gives absolutely no clue about 
this extra stage which ignores the input arguments and sues 'ascii' and 
'strict'. It might also be worth stating in the documentation that the pattern 
(u.encode(encoding) if isinstance(u, unicode) else u) can be helpful for cases 
where you unavoidably have to deal with both kinds of input, string calling 
str.encode is such a bad idea. 

In an ideal world I'd love to see the implementation of 
str.encode/unicode.decode changed to be more useful (i.e. instead of using 
ascii, it would be more logical and useful to use the passed-in encoding to 
perform the initial decode/encode, and the apss-in 'errors' value). I wasn't 
sure if that change would be accepted so for now I'm proposing better 
documentation of the existing behaviour as a second-best.

----------
assignee: docs@python
components: Documentation
messages: 260359
nosy: benspiller, docs@python
priority: normal
severity: normal
status: open
title: doc for unicode.decode and str.encode is unnecessarily confusing
type: behavior
versions: Python 2.7

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue26369>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue26369] doc for unicode.decode and str.encode is unnecessarily confusing

Reply via email to