Josh Rosenberg added the comment:
Agree with Steven; the whole reason Python 3 changed from unicode and str to
str and bytes was because having Py2 str be text sometimes, and binary data at
other times is confusing. The existing behavior can't change in Py2 in any
meaningful way without breaking existing code, introducing special cases for
text->text encodings (where Python 3 supports them using the codecs module
only), behaving in non-obvious ways in corner cases, etc. Silently treating
str.encode("utf-8") to mean "decode as UTF-8 and throw away the result to
verify that it's already UTF-8 bytes" is not particularly intuitive either.
It does seem like a doc fix would be useful though; right now, we have only
"String methods" documented, with no distinction between str and unicode. It
might be helpful to explicitly deprecate str.encode on str objects, and
unicode.decode, with a note that while it's meaningful to use these methods in
Python 2 for text<->text encoding/decoding, the methods don't exist at all in
Python 3.
Otherwise, yes, if you want consistent text/binary types, that's what Python 3
is for. Python 2 has tons of flaws when it comes to handling unicode (e.g. csv
module), and fixing any given single problem (creating backward compatibility
headaches in the process) is not worth the trouble.
If you're concerned about excessive boilerplate, just write a function (or a
type) that allows you to perform the tests/conversions you care about as a
single call. For example, the following seems like it achieves your objectives
(one line usage, handles str by verifying that it's legal in provided encoding
in strict mode, dropping/replacing characters in ignore/replace mode, etc.):
def basestringencode(s, encoding=sys.getdefaultencoding(), errors="strict"):
if isinstance(s, str):
# Decode with provided rules, so a str with illegal characters
# raises exception, replaces, ignores, etc. per arguments
s = s.decode(encoding, errors)
return s.encode(encoding, errors)
If you don't want to see UnicodeDecodeError, you either pass 'ignore' for
errors, or wrap the s.decode step in a try/except and raise a different
exception type.
The biggest change I could see happening code wise would be a textual change to
the UnicodeDecodeError error str.encode raises, so str.encode specifically
replaces the default error message (but not type, for back compat reasons) with
something like "str.encode cannot perform implicit decode with
sys.getdefaultencoding(); use .encode only with unicode objects"
----------
nosy: +josh.r
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue26369>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com