On Tue, 2 Feb 2010 22:56:22 +0100
Norman Khine <[email protected]> wrote:
> i am no expert, but there seems to be a bigger difference.
>
> with repr(), i get:
> Sat\\xe9re Maw\\xe9
>
> where as you get
>
> Sat\xc3\xa9re Maw\xc3\xa9
>
> repr()'s
> é == \\xe9
> whereas on your version
> é == \xc3\xa9
This is a rather complicated issue mixing python str, unicode string, and their
repr().
Kent is right in that the *python string* "\xc3\xa9" is the utf8 formatted
representation of 'é' (2 bytes). While \xe9 is the *unicode code* for 'é',
which should only appear in a unicode string.
So:
unicode.encode(u"\u00e9", "utf8") == "\xc3\xa9"
or more simply:
u"\u00e9".encode("utf8") == "\xc3\xa9"
Conversely:
unicode("\xc3\xa9", "utf8") == u"\u00e9" -- decoding
The question is: what do you want to do with the result? You'll need either the
utf8 form "\xc3\xa9" (for output) or the unicode string u"\u00e9" (for
processing). But what you actually get is a kind of mix, actually the (python
str) repr of a unicode string.
> also, i still get an empty list when i run the code as suggested.
? Strange. Have you checked the re.DOTALL? (else regex patterns stop matching
at \n by default)
Denis
________________________________
la vita e estrany
http://spir.wikidot.com/
_______________________________________________
Tutor maillist - [email protected]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor