[issue24019] str/unicode encoding kwarg causes exceptions

Mahmoud Hashemi Mon, 20 Apr 2015 21:57:20 -0700

New submission from Mahmoud Hashemi:

The encoding keyword argument to the Python 3 str() and Python 2 unicode() 
constructors is excessively constraining to the practical use of these core 
types.


Looking at common usage, both these constructors' primary mode is to convert 
various objects into text:

>>> str(2)
'2'

But adding an encoding yields:

>>> str(2, encoding='utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: coercing to str: need bytes, bytearray or buffer-like object, int 
found

While the error message is fine for an experienced developer, I would like to 
raise the question: is it necessary at all? Even harmlessly getting a str from 
a str is punished, but leaving off encoding is fine again:

>>> str('hi', encoding='utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: decoding str is not supported
>>> str('hi')
'hi'

Merging and simplifying the two modes of these constructors would yield much 
more predictable results for experienced and beginning Pythonists alike. 
Basically, the encoding argument should be ignored if the argument is already a 
unicode/str instance, or if it is a non-string object. It should only be 
consulted if the primary argument is a bytestring. Bytestrings already have a 
.decode() method on them, another, obscurer version of it isn't necessary.

Furthermore, despite the core nature and widespread usage of these types, 
changing this behavior should break very little existing code and 
understanding. unicode() and str() will simply behave as expected more often, 
returning text versions of the arguments passed to them. 

Appendix: To demonstrate the expected behavior of the proposed unicode/str, 
here is a code snippet we've employed to sanely and safely get a text version 
of an arbitrary object:

def to_unicode(obj, encoding='utf8', errors='strict'):
    # the encoding default should look at sys's value
    try:
        return unicode(obj)
    except UnicodeDecodeError:
        return unicode(obj, encoding=encoding, errors=errors)

After many years of writing Python and teaching it to developers of all 
experience levels, I firmly believe that this is the right interaction pattern 
for Python's core text type. I'm also happy to expand on this issue, turn it 
into a PEP, or submit a patch if there is interest.

----------
components: Unicode
messages: 241699
nosy: ezio.melotti, haypo, mahmoud
priority: normal
severity: normal
status: open
title: str/unicode encoding kwarg causes exceptions
type: behavior
versions: Python 2.7, Python 3.4, Python 3.5, Python 3.6

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue24019>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue24019] str/unicode encoding kwarg causes exceptions

Reply via email to