In <[EMAIL PROTECTED]>, jdonnell wrote: > Thanks for all the replies. I just got in to work so I haven't tried > any of them yet. I see that I wasn't as clear as I should have been so > I'll clarify a little. I'm grabbing some data from msn's rss feed. > Here's an example. > http://search.msn.com/results.aspx?q=domain+name&format=rss&FORM=ZZRE
Then you are getting UTF-8 encoded strings. > The string ' all domain name extensions  Good' is where I have a > problem. The > ' Â' shows up as 'à à ÃÂ' when I write it to a file or stick > it in mysql. I did a hex dump and this is what I see. > > [EMAIL PROTECTED]:~/scripts> cat test.txt > extensions  Good > [EMAIL PROTECTED]:~/scripts> xxd test.txt > 0000000: 6578 7465 6e73 696f 6e73 20c2 a020 c2a0 extensions .. .. > 0000010: 20c2 bb20 476f 6f64 0a .. Good > > One thing that jumps out is that two of the Ã's are c2a0, but one of > them is c2bb. Well, those are the details since I wasn't clear before. That are two no-break spaces and a 'Â' character:: In [42]: import unicodedata In [43]: unicodedata.name('\xc2\xa0'.decode('utf-8')) Out[43]: 'NO-BREAK SPACE' In [44]: unicodedata.name('\xc2\xbb'.decode('utf-8')) Out[44]: 'RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK' Ciao, Marc 'BlackJack' Rintsch -- http://mail.python.org/mailman/listinfo/python-list