Below is an example. The program may runs fine at the beginning. But as soon as an unicode character u'b' is introduced, the program boom out unexpectedly.
'ascii'sys.getdefaultencoding()
a='\xe5' # can print, you think you're ok
... print a å
Traceback (most recent call last):b=u'b' a==b
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)
One may suggest the correct way to do it is to use decode, such as
a.decode('latin-1') == b
This brings up another issue. Most references and books focus exclusive on entering unicode literal and using the encode/decode methods. The fallacy is that string is such a basic data type use throughout the program, you really don't want to make a individual decision everytime when you use string (and take a penalty for any negligence). The Java has a much more usable model with unicode used internally and encoding/decoding decision only need twice when dealing with input and output.
I am sure these errors are a nuisance to those who are half conscious to unicode. Even for those who choose to use unicode, it is almost impossible to ensure their program work correctly.
--
http://mail.python.org/mailman/listinfo/python-list