[issue26369] unicode.decode and str.encode are unnecessarily confusing for non-ascii

Josh Rosenberg Thu, 19 May 2016 11:38:29 -0700

Josh Rosenberg added the comment:

Agree with Steven; the whole reason Python 3 changed from unicode and str to 
str and bytes was because having Py2 str be text sometimes, and binary data at 
other times is confusing. The existing behavior can't change in Py2 in any 
meaningful way without breaking existing code, introducing special cases for 
text->text encodings (where Python 3 supports them using the codecs module 
only), behaving in non-obvious ways in corner cases, etc. Silently treating 
str.encode("utf-8") to mean "decode as UTF-8 and throw away the result to 
verify that it's already UTF-8 bytes" is not particularly intuitive either.


It does seem like a doc fix would be useful though; right now, we have only 
"String methods" documented, with no distinction between str and unicode. It 
might be helpful to explicitly deprecate str.encode on str objects, and 
unicode.decode, with a note that while it's meaningful to use these methods in 
Python 2 for text<->text encoding/decoding, the methods don't exist at all in 
Python 3.

Otherwise, yes, if you want consistent text/binary types, that's what Python 3 
is for. Python 2 has tons of flaws when it comes to handling unicode (e.g. csv 
module), and fixing any given single problem (creating backward compatibility 
headaches in the process) is not worth the trouble.

If you're concerned about excessive boilerplate, just write a function (or a 
type) that allows you to perform the tests/conversions you care about as a 
single call. For example, the following seems like it achieves your objectives 
(one line usage, handles str by verifying that it's legal in provided encoding 
in strict mode, dropping/replacing characters in ignore/replace mode, etc.):

def basestringencode(s, encoding=sys.getdefaultencoding(), errors="strict"):
    if isinstance(s, str):
        # Decode with provided rules, so a str with illegal characters
        # raises exception, replaces, ignores, etc. per arguments
        s = s.decode(encoding, errors)
    return s.encode(encoding, errors)

If you don't want to see UnicodeDecodeError, you either pass 'ignore' for 
errors, or wrap the s.decode step in a try/except and raise a different 
exception type.

The biggest change I could see happening code wise would be a textual change to 
the UnicodeDecodeError error str.encode raises, so str.encode specifically 
replaces the default error message (but not type, for back compat reasons) with 
something like "str.encode cannot perform implicit decode with 
sys.getdefaultencoding(); use .encode only with unicode objects"

----------
nosy: +josh.r

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue26369>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue26369] unicode.decode and str.encode are unnecessarily confusing for non-ascii

Reply via email to