New submission from Ben Spiller:
It's well known that lots of people struggle writing correct programs using
non-ascii strings in python 2.x, but I think one of the main reasons for this
could be very easily fixed with a small addition to the documentation for
str.encode and unicode.decode, which is currently quite vague.
The decode/encode methods really make most sense when called on a unicode
string i.e. unicode.encode() to produce a byte string, or on a byte string e.g.
str.decode() to produce a unicode object from a byte string.
However, the additional presence of the opposite methods str.encode() and
unicode.decode() is quite confusing, and a frequent source of errors - e.g.
calling str.encode('utf-8') first DECODES the str object (which might already
be in utf8) to a unicode string **using the default encoding of "ascii"** (!)
before ENCODING to a utf-8 byte str as requested, which of course will fail at
the first stage with the classic error "UnicodeDecodeError: 'ascii' codec can't
decode byte" if there are any non-ascii chars present. It's unfortunate that
this initial decode/encode stage ignores both the "encoding" argument (used
only for the subsequent encode/decode) and the "errors" argument (commonly used
when the programmer is happy with a best-effort conversion e.g. for logging
purposes).
Anyway, given this behaviour, a lot of time would be saved by a simple sentence
on the doc for str.encode()/unicode.decode() essentially warning people that
those methods aren't that useful and they probably really intended to use
str.decode()/unicode.encode() - the current doc gives absolutely no clue about
this extra stage which ignores the input arguments and sues 'ascii' and
'strict'. It might also be worth stating in the documentation that the pattern
(u.encode(encoding) if isinstance(u, unicode) else u) can be helpful for cases
where you unavoidably have to deal with both kinds of input, string calling
str.encode is such a bad idea.
In an ideal world I'd love to see the implementation of
str.encode/unicode.decode changed to be more useful (i.e. instead of using
ascii, it would be more logical and useful to use the passed-in encoding to
perform the initial decode/encode, and the apss-in 'errors' value). I wasn't
sure if that change would be accepted so for now I'm proposing better
documentation of the existing behaviour as a second-best.
----------
assignee: docs@python
components: Documentation
messages: 260359
nosy: benspiller, docs@python
priority: normal
severity: normal
status: open
title: doc for unicode.decode and str.encode is unnecessarily confusing
type: behavior
versions: Python 2.7
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue26369>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com