Re: Unicode again ... default codec ...

Gabriel Genellina Fri, 30 Oct 2009 15:34:02 -0700

En Fri, 30 Oct 2009 13:40:14 -0300, zooko <[email protected]> escribió:

On Oct 20, 9:50 pm, "Gabriel Genellina" <[email protected]>
wrote:

DON'T do that. Really. Changing the default encoding is a horrible,
horrible hack and causes a lot of problems.


I'm not convinced.  I've read all of the posts and web pages and blog
entries decrying this practice over the last several years, but as far
as I can tell the actual harm that can result is limited (as long as
you set it to utf-8) and the practical benefits are substantial.  This
is a pattern that I have no problem using:

import sys
reload(sys)
sys.setdefaultencoding("utf-8")

The reason this doesn't cause too much harm is that anything that
would have worked with the original default encoding ('ascii') will
also work with the new utf-8 default encoding.

Wrong. Dictionaries may start behaving incorrectly, by example. Normally,two keys that compare equal cannot coexist in the same dictionary:


py> 1 == 1.0
True
py> d = {}
py> d[1] = '*'
py> d[1.0]
'*'
py> d[1.0] = '$'
py> d
{1: '$'}

1 and 1.0 are the same key, as far as the dictionary is concerned. Forthis to work, both keys must have the same hash:


py> hash(1) == hash(1.0)
True

Now, let's set the default encoding to utf-8:

py> import sys
py> reload(sys)
<module 'sys' (built-in)>
py> sys.setdefaultencoding('utf-8')
py> x = u'á'
py> y = u'á'.encode('utf-8')
py> x
u'\xe1'
py> y
'\xc3\xa1'

(same as y='á' if the source encoding is set to utf-8, but I don't want todepend on that). Just to be sure we're dealing with the right character:


py> import unicodedata
py> unicodedata.name(x)
'LATIN SMALL LETTER A WITH ACUTE'
py> unicodedata.name(y.decode('utf-8'))
'LATIN SMALL LETTER A WITH ACUTE'

Now, we can see that both x and y are equal:

py> x == y
True

x is an accented a, y is the same thing encoded using the defaultencoding, both are equal. Fine. Now create a dictionary:


py> d = {}
py> d[x] = '*'
py> d[x]
'*'
py> x in d
True
py> y in d
False            # ???
py> d[y] = 2
py> d
{u'\xe1': '*', '\xc3\xa1': 2} # ????

Since x==y, one should expect a single entry in the dictionary - but wegot two. That's because:


py> x == y
True
py> hash(x) == hash(y)
False

and this must *not* happen according tohttp://docs.python.org/reference/datamodel.html#object.__hash__"The only required property is that objects which compare equal have thesame hash value"Considering that dictionaries in Python are used almost everywhere,breaking this basic asumption is a really bad problem.

Of course, all of this applies to Python 2.x; in Python 3.0 the problemwas solved differently; strings are unicode by default, and the defaultencoding IS utf-8.

As far as I've seen
from the aforementioned mailing list threads and blog posts and so on,
the worst thing that has ever happened as a result of this technique
is that something works for you but fails for someone else who doesn't
have this stanza.  (http://tarekziade.wordpress.com/2008/01/08/
syssetdefaultencoding-is-evil/ .)  That's bad, but probably just
including this stanza at the top of the file that you are sharing with
that other person instead of doing it in a sitecustomize.py file will
avoid that problem.

And then you break all other libraries that the program is using,including the Python standard library, because the default encoding is aglobal setting. What if another library decides to use latin-1 as thedefault encoding, using the same trick? Latest one wins...

You said "the practical benefits are substantial" but I, for myself,cannot see any benefit. Perhaps if you post your real problems, someonecan find the solution.The right way is to fix your program to do the right thing, not to hidethe bugs under the rug.


--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode again ... default codec ...

Reply via email to