En Fri, 30 Oct 2009 13:40:14 -0300, zooko <zoo...@gmail.com> escribió:
On Oct 20, 9:50 pm, "Gabriel Genellina" <gagsl-...@yahoo.com.ar>
wrote:

DON'T do that. Really. Changing the default encoding is a horrible,
horrible hack and causes a lot of problems.

I'm not convinced.  I've read all of the posts and web pages and blog
entries decrying this practice over the last several years, but as far
as I can tell the actual harm that can result is limited (as long as
you set it to utf-8) and the practical benefits are substantial.  This
is a pattern that I have no problem using:

import sys
reload(sys)
sys.setdefaultencoding("utf-8")

The reason this doesn't cause too much harm is that anything that
would have worked with the original default encoding ('ascii') will
also work with the new utf-8 default encoding.

Wrong. Dictionaries may start behaving incorrectly, by example. Normally, two keys that compare equal cannot coexist in the same dictionary:

py> 1 == 1.0
True
py> d = {}
py> d[1] = '*'
py> d[1.0]
'*'
py> d[1.0] = '$'
py> d
{1: '$'}

1 and 1.0 are the same key, as far as the dictionary is concerned. For this to work, both keys must have the same hash:

py> hash(1) == hash(1.0)
True

Now, let's set the default encoding to utf-8:

py> import sys
py> reload(sys)
<module 'sys' (built-in)>
py> sys.setdefaultencoding('utf-8')
py> x = u'á'
py> y = u'á'.encode('utf-8')
py> x
u'\xe1'
py> y
'\xc3\xa1'

(same as y='á' if the source encoding is set to utf-8, but I don't want to depend on that). Just to be sure we're dealing with the right character:

py> import unicodedata
py> unicodedata.name(x)
'LATIN SMALL LETTER A WITH ACUTE'
py> unicodedata.name(y.decode('utf-8'))
'LATIN SMALL LETTER A WITH ACUTE'

Now, we can see that both x and y are equal:

py> x == y
True

x is an accented a, y is the same thing encoded using the default encoding, both are equal. Fine. Now create a dictionary:

py> d = {}
py> d[x] = '*'
py> d[x]
'*'
py> x in d
True
py> y in d
False            # ???
py> d[y] = 2
py> d
{u'\xe1': '*', '\xc3\xa1': 2} # ????

Since x==y, one should expect a single entry in the dictionary - but we got two. That's because:

py> x == y
True
py> hash(x) == hash(y)
False

and this must *not* happen according to http://docs.python.org/reference/datamodel.html#object.__hash__ "The only required property is that objects which compare equal have the same hash value" Considering that dictionaries in Python are used almost everywhere, breaking this basic asumption is a really bad problem.

Of course, all of this applies to Python 2.x; in Python 3.0 the problem was solved differently; strings are unicode by default, and the default encoding IS utf-8.

As far as I've seen
from the aforementioned mailing list threads and blog posts and so on,
the worst thing that has ever happened as a result of this technique
is that something works for you but fails for someone else who doesn't
have this stanza.  (http://tarekziade.wordpress.com/2008/01/08/
syssetdefaultencoding-is-evil/ .)  That's bad, but probably just
including this stanza at the top of the file that you are sharing with
that other person instead of doing it in a sitecustomize.py file will
avoid that problem.

And then you break all other libraries that the program is using, including the Python standard library, because the default encoding is a global setting. What if another library decides to use latin-1 as the default encoding, using the same trick? Latest one wins...

You said "the practical benefits are substantial" but I, for myself, cannot see any benefit. Perhaps if you post your real problems, someone can find the solution. The right way is to fix your program to do the right thing, not to hide the bugs under the rug.

--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to