On Tue, 2 Feb 2010 22:56:22 +0100
Norman Khine <nor...@khine.net> wrote:

> i am no expert, but there seems to be a bigger difference.
> 
> with repr(), i get:
> Sat\\xe9re Maw\\xe9
> 
> where as you get
> 
> Sat\xc3\xa9re Maw\xc3\xa9
> 
> repr()'s
> é == \\xe9
> whereas on your version
> é == \xc3\xa9

This is a rather complicated issue mixing python str, unicode string, and their 
repr().
Kent is right in that the *python string* "\xc3\xa9" is the utf8 formatted 
representation of 'é' (2 bytes). While \xe9 is the *unicode code* for 'é', 
which should only appear in a unicode string.
So:
   unicode.encode(u"\u00e9", "utf8") == "\xc3\xa9"
or more simply:
   u"\u00e9".encode("utf8") == "\xc3\xa9"
Conversely:
   unicode("\xc3\xa9", "utf8") == u"\u00e9"     -- decoding

The question is: what do you want to do with the result? You'll need either the 
utf8 form "\xc3\xa9" (for output) or the unicode string u"\u00e9" (for 
processing). But what you actually get is a kind of mix, actually the (python 
str) repr of a unicode string.

> also, i still get an empty list when i run the code as suggested.

? Strange. Have you checked the re.DOTALL? (else regex patterns stop matching 
at \n by default)


Denis
________________________________

la vita e estrany

http://spir.wikidot.com/
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to