In <[EMAIL PROTECTED]>, jdonnell wrote:

> Thanks for all the replies. I just got in to work so I haven't tried
> any of them yet. I see that I wasn't as clear as I should have been so
> I'll clarify a little. I'm grabbing some data from msn's rss feed.
> Here's an example.
> http://search.msn.com/results.aspx?q=domain+name&format=rss&FORM=ZZRE

Then you are getting UTF-8 encoded strings.

> The string ' all domain name extensions     Â Good' is where I have a
> problem. The
> '    Â' shows up as  'Ã  Ã  ÃÂ' when I write it to a file or stick
> it in mysql. I did a hex dump and this is what I see.
> 
> [EMAIL PROTECTED]:~/scripts> cat test.txt
> extensions     Â Good
> [EMAIL PROTECTED]:~/scripts> xxd test.txt
> 0000000: 6578 7465 6e73 696f 6e73 20c2 a020 c2a0  extensions .. ..
> 0000010: 20c2 bb20 476f 6f64 0a                    .. Good
> 
> One thing that jumps out is that two of the Ã's are c2a0, but one of
> them is c2bb. Well, those are the details since I wasn't clear before.

That are two no-break spaces and a 'Â' character::

  In [42]: import unicodedata

  In [43]: unicodedata.name('\xc2\xa0'.decode('utf-8'))
  Out[43]: 'NO-BREAK SPACE'

  In [44]: unicodedata.name('\xc2\xbb'.decode('utf-8'))
  Out[44]: 'RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK'

Ciao,
        Marc 'BlackJack' Rintsch
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to