unicode encoding usablilty problem

aurora Fri, 18 Feb 2005 09:15:50 -0800

I have long find the Python default encoding of strict ASCII frustrating. For one thing I prefer to get garbage character than an exception. But the biggest issue is Unicode exception often pop up in unexpected places and only when a non-ASCII or unicode character first found its way into the system.

Below is an example. The program may runs fine at the beginning. But as soon as an unicode character u'b' is introduced, the program boom out unexpectedly.

sys.getdefaultencoding()

'ascii'

a='\xe5'
# can print, you think you're ok

... print a
å

b=u'b'
a==b

Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)

One may suggest the correct way to do it is to use decode, such as

  a.decode('latin-1') == b

This brings up another issue. Most references and books focus exclusive on entering unicode literal and using the encode/decode methods. The fallacy is that string is such a basic data type use throughout the program, you really don't want to make a individual decision everytime when you use string (and take a penalty for any negligence). The Java has a much more usable model with unicode used internally and encoding/decoding decision only need twice when dealing with input and output.

I am sure these errors are a nuisance to those who are half conscious to unicode. Even for those who choose to use unicode, it is almost impossible to ensure their program work correctly. -- http://mail.python.org/mailman/listinfo/python-list

unicode encoding usablilty problem

Reply via email to