On Tue, 22 Mar 2005 20:09:55 -0600, "John Roth" <[EMAIL PROTECTED]> wrote:
>I had this problem recently. It turned out that something >had encoded a unicode string into utf-8. When I found >the culprit and fixed the underlying design issue, it went away. > >John Roth > > > >"jdonnell" <[EMAIL PROTECTED]> wrote in message >news:[EMAIL PROTECTED] >I have a mysql database with characters like   » in it. I'm >trying to write a python script to remove these, but I'm having a >really hard time. > >These strings are coming out as type 'str' not 'unicode' so I tried to >just > >record[4].replace('Â', '') > >but this does nothing. However the following code works > >#!/usr/bin/python > >s = 'aaaaa  aaa' >print type(s) >print s >print s.find('Â') > >This returns ><type 'str'> >aaaaa  aaa >6 > >The other odd thing is that the  character shows up as two spaces if >I print it to the terminal from mysql, but it shows up as  when I >print from the simple script above. >What am I doing wrong? > What encodings are involved? This is from idle on windows, which seems to display latin-1 source ok: ---- >>> "Latin-1:»\n".decode('latin-1') u'Latin-1:\xc2\xbb\n' >>> "Latin-1:»\n".decode('latin-1').encode('cp437', 'replace') 'Latin-1:?\xaf\n' >>> "Latin-1:»\n".decode('latin-1').encode('cp437', 'ignore') 'Latin-1:\xaf\n' >>> u'Latin-1:\xc2\xbb\n'.encode('cp437','replace') 'Latin-1:?\xaf\n' >>> ---- Now this is in an NT4 console windows with code page 437: ---- >>> u'Latin-1:\xc2\xbb\n'.encode('cp437','replace') 'Latin-1:?\xaf\n' >>> import sys >>> sys.stdout.write(u'Latin-1:\xc2\xbb\n'.encode('cp437','replace')) Latin-1:?» ---- Notice that the interactive output does a repr that creates the \xaf, but the character is available and can be written non-repr'd via sys.stdout.write. For the heck of it: >>> sys.stdout.write(u'Latin-1:\xc2\xbb\n'.encode('cp437','xmlcharrefreplace')) Latin-1:» I don't know if this is going to get through to your screen ;-) Regards, Bengt Richter -- http://mail.python.org/mailman/listinfo/python-list