[issue26369] unicode.decode and str.encode are unnecessarily confusing for non-ascii

Ben Spiller Thu, 19 May 2016 09:24:32 -0700

Ben Spiller added the comment:

btw If anyone can find the place in the code (sorry I tried and failed!) where 
str.encode('utf-8', error=X) is resulting in an implicit call to the equivalent 
of decode(defaultencoding, errors=strict) (as suggested by the exception 
message) I think it'll be easier to discuss the details of fixing.


Thanks for your reply - yes I'm aware that theoretically you _could_ globally 
change python's default encoding from ascii, but the prevailing view I've heard 
from python developers seems to be that changing it is not a good idea and may 
cause lots of library code to break. Also it's probably not a good idea for 
individual libraries or modules to be changing global state that affects the 
entire python invocation, and it would be nice to find a less fragile and more 
out-of-the-box solution to this. You may well be using different encodings (not 
just utf-8) to be used in different parts of your program - so changing the 
globally-defined default encoding doesn't seem right, especially for a method 
like str.encode method that already takes an 'encoding' argument (used 
currently only for the encoding aspect, not the decoding aspect). 

I do think there's a strong case to be made for changing the str.encode (and 
also unicode.decode) behaviour so that str.encode('utf-8') behaves the same 
whether it's given ascii or non-ascii characters, and also similar to 
unicode.encode('utf-8'). Let me try to persuade you... :)

First, to address the point you made:

> If str.encode() raises a decoding exception, this is a programming bug. It 
> would be bad to hide it.

I totally agree with the general principal of not hiding programming bugs. 
However if calling str.encode for codecs like utf8 (let's ignore base64 for 
now, which is a very different beast) was *consistently* treated as a 
'programming bug' by python and always resulted in an exception that would be 
ok (suboptimal usability imho, but still ok), since programmers would quickly 
spot the problem and fix it. But that's not what happens - it *silently works* 
(is a no-op) as long as you happen to be using ASCII characters so this 
so-called 'programming bug' will go unnoticed by most programmers (and authors 
of third party library code you might be relying on!)... but the moment a 
non-ascii character get introduced suddenly you'll get an exception, maybe in 
some library code you rely on but can't fix. For this reason I don't think 
treating this as a programming bug is helping anyone write more robust python 
code - quite the reverse. Plus I think the behaviour of being a no-op is almost 
always
  'what you would have wanted it to do' anyway, whereas the behaviour of 
throwing an exception almost never is. 

I think we'd agree that changing str.encode(utf8) to throw an exception in 
*all* cases wouldn't be a realistic option since it would certainly break 
backwards compatability in painful ways for many existing apps and library 
code. 

So, if we want to make the behaviour of this important built-in type a bit more 
consistent and less error-prone/fragile for this case then I think the only 
option is making str.encode be a no-op for non-ascii characters (at least, 
non-ascii characters that are valid in the specified encoding), just as it is 
for ascii characters. 

Here's why I think ditching the current behaviour would be a good idea:
- calling str.encode() and getting a DecodeError is confusing ("I asked you to 
encode this string, what are you decoding for?")
- calling str.encode('utf-8') and getting an exception about "ascii" is 
confusing as the only encoding I mentioned in the method call was utf-8
- calling encode(..., errors=ignore) and getting an exception is confusing and 
feels like a bug; I've explicitly specified that I do NOT want exceptions from 
calling this method, yet (because neither 'errors' nor 'encoding' argument gets 
passed to the implicit - and undocumented - decode operation), I get unexpected 
behaviour that is far more likely to break my program than a no-op
- the somewhat surprising behaviour we're talking about is not explicitly 
documented anywhere
- having str.encode throw on non-ascii but not ascii makes it very likely that 
code will be written and shipped (including library code you may have no 
control over) that *appears* to work under normal testing but has *hidden* bugs 
that surface only once non-ascii characters are used. 
- in every situation I can think of, having str.encode(encoding, errors=ignore) 
honour the encoding and errors arguments even for the implicit-decode operation 
is more useful than having it ignore those arguments and throw an exception
- a quick google shows lots of people in the Python community (from newbies to 
experts) are seeing this exception and being confused by it, therefore a lot of 
people's lives might be improved if we can somehow make the situation better :)
- even with the best of intentions (and with code written by senior python 
programmers who understand unicode issues well) it's very difficult in practice 
to write non-trivial python programs that always consistently use the 'unicode' 
string type throughout (especially when legacy code or third party libraries 
are involved), so most 'real' code needs to cope with a mix of str and unicode 
types in practice. So when you need to write a 'basestring' out to a file, 
you'd like to be able to simply call the s.encode(myencoding, errors=whatever) 
method that exists on both str and unicode types and have it 'work' whether 
it's a str already in that encoding or a unicode object that needs to be 
converted. This is a common use case and the behaviour I'm suggesting would 
really help with this case. The alternative is that every python programmer who 
cares about non-ascii characters has to write an pleasant and un-pythonic if 
clause to give different behaviour based on __type__, in every place th
 ey need a byte str:
        if isinstance(s, unicode): 
                f.write(s.encode('utf-8', errors='ignore'))
        else:
                f.write(s)

nb: Although I've used the example of str.encode above, unicode.decode has the 
exact same issues (and potential solution), and of course this isn't specific 
to utf-8 but to all codecs that convert between str and unicode (i.e. most of 
them except base64). 

I hope you'll consider this proposal - it's probably not a very big change, is 
very unlikely to break any existing/working code, and has the potential to help 
reduce fragility and difficult-to-resolve bugs in an area of Python that seems 
to cause pain and confusion to lots of people.

Thanks for considering!

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue26369>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue26369] unicode.decode and str.encode are unnecessarily confusing for non-ascii

Reply via email to