Ben Spiller added the comment:
btw If anyone can find the place in the code (sorry I tried and failed!) where
str.encode('utf-8', error=X) is resulting in an implicit call to the equivalent
of decode(defaultencoding, errors=strict) (as suggested by the exception
message) I think it'll be easier to discuss the details of fixing.
Thanks for your reply - yes I'm aware that theoretically you _could_ globally
change python's default encoding from ascii, but the prevailing view I've heard
from python developers seems to be that changing it is not a good idea and may
cause lots of library code to break. Also it's probably not a good idea for
individual libraries or modules to be changing global state that affects the
entire python invocation, and it would be nice to find a less fragile and more
out-of-the-box solution to this. You may well be using different encodings (not
just utf-8) to be used in different parts of your program - so changing the
globally-defined default encoding doesn't seem right, especially for a method
like str.encode method that already takes an 'encoding' argument (used
currently only for the encoding aspect, not the decoding aspect).
I do think there's a strong case to be made for changing the str.encode (and
also unicode.decode) behaviour so that str.encode('utf-8') behaves the same
whether it's given ascii or non-ascii characters, and also similar to
unicode.encode('utf-8'). Let me try to persuade you... :)
First, to address the point you made:
> If str.encode() raises a decoding exception, this is a programming bug. It
> would be bad to hide it.
I totally agree with the general principal of not hiding programming bugs.
However if calling str.encode for codecs like utf8 (let's ignore base64 for
now, which is a very different beast) was *consistently* treated as a
'programming bug' by python and always resulted in an exception that would be
ok (suboptimal usability imho, but still ok), since programmers would quickly
spot the problem and fix it. But that's not what happens - it *silently works*
(is a no-op) as long as you happen to be using ASCII characters so this
so-called 'programming bug' will go unnoticed by most programmers (and authors
of third party library code you might be relying on!)... but the moment a
non-ascii character get introduced suddenly you'll get an exception, maybe in
some library code you rely on but can't fix. For this reason I don't think
treating this as a programming bug is helping anyone write more robust python
code - quite the reverse. Plus I think the behaviour of being a no-op is almost
always
'what you would have wanted it to do' anyway, whereas the behaviour of
throwing an exception almost never is.
I think we'd agree that changing str.encode(utf8) to throw an exception in
*all* cases wouldn't be a realistic option since it would certainly break
backwards compatability in painful ways for many existing apps and library
code.
So, if we want to make the behaviour of this important built-in type a bit more
consistent and less error-prone/fragile for this case then I think the only
option is making str.encode be a no-op for non-ascii characters (at least,
non-ascii characters that are valid in the specified encoding), just as it is
for ascii characters.
Here's why I think ditching the current behaviour would be a good idea:
- calling str.encode() and getting a DecodeError is confusing ("I asked you to
encode this string, what are you decoding for?")
- calling str.encode('utf-8') and getting an exception about "ascii" is
confusing as the only encoding I mentioned in the method call was utf-8
- calling encode(..., errors=ignore) and getting an exception is confusing and
feels like a bug; I've explicitly specified that I do NOT want exceptions from
calling this method, yet (because neither 'errors' nor 'encoding' argument gets
passed to the implicit - and undocumented - decode operation), I get unexpected
behaviour that is far more likely to break my program than a no-op
- the somewhat surprising behaviour we're talking about is not explicitly
documented anywhere
- having str.encode throw on non-ascii but not ascii makes it very likely that
code will be written and shipped (including library code you may have no
control over) that *appears* to work under normal testing but has *hidden* bugs
that surface only once non-ascii characters are used.
- in every situation I can think of, having str.encode(encoding, errors=ignore)
honour the encoding and errors arguments even for the implicit-decode operation
is more useful than having it ignore those arguments and throw an exception
- a quick google shows lots of people in the Python community (from newbies to
experts) are seeing this exception and being confused by it, therefore a lot of
people's lives might be improved if we can somehow make the situation better :)
- even with the best of intentions (and with code written by senior python
programmers who understand unicode issues well) it's very difficult in practice
to write non-trivial python programs that always consistently use the 'unicode'
string type throughout (especially when legacy code or third party libraries
are involved), so most 'real' code needs to cope with a mix of str and unicode
types in practice. So when you need to write a 'basestring' out to a file,
you'd like to be able to simply call the s.encode(myencoding, errors=whatever)
method that exists on both str and unicode types and have it 'work' whether
it's a str already in that encoding or a unicode object that needs to be
converted. This is a common use case and the behaviour I'm suggesting would
really help with this case. The alternative is that every python programmer who
cares about non-ascii characters has to write an pleasant and un-pythonic if
clause to give different behaviour based on __type__, in every place th
ey need a byte str:
if isinstance(s, unicode):
f.write(s.encode('utf-8', errors='ignore'))
else:
f.write(s)
nb: Although I've used the example of str.encode above, unicode.decode has the
exact same issues (and potential solution), and of course this isn't specific
to utf-8 but to all codecs that convert between str and unicode (i.e. most of
them except base64).
I hope you'll consider this proposal - it's probably not a very big change, is
very unlikely to break any existing/working code, and has the potential to help
reduce fragility and difficult-to-resolve bugs in an area of Python that seems
to cause pain and confusion to lots of people.
Thanks for considering!
----------
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue26369>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com