Re: [Tutor] parse text file

spir Wed, 03 Feb 2010 00:20:07 -0800

On Tue, 2 Feb 2010 22:56:22 +0100
Norman Khine <[email protected]> wrote:


> i am no expert, but there seems to be a bigger difference.
> 
> with repr(), i get:
> Sat\\xe9re Maw\\xe9
> 
> where as you get
> 
> Sat\xc3\xa9re Maw\xc3\xa9
> 
> repr()'s
> é == \\xe9
> whereas on your version
> é == \xc3\xa9

This is a rather complicated issue mixing python str, unicode string, and their 
repr().
Kent is right in that the *python string* "\xc3\xa9" is the utf8 formatted 
representation of 'é' (2 bytes). While \xe9 is the *unicode code* for 'é', 
which should only appear in a unicode string.
So:
   unicode.encode(u"\u00e9", "utf8") == "\xc3\xa9"
or more simply:
   u"\u00e9".encode("utf8") == "\xc3\xa9"
Conversely:
   unicode("\xc3\xa9", "utf8") == u"\u00e9"     -- decoding

The question is: what do you want to do with the result? You'll need either the 
utf8 form "\xc3\xa9" (for output) or the unicode string u"\u00e9" (for 
processing). But what you actually get is a kind of mix, actually the (python 
str) repr of a unicode string.

> also, i still get an empty list when i run the code as suggested.

? Strange. Have you checked the re.DOTALL? (else regex patterns stop matching 
at \n by default)


Denis
________________________________

la vita e estrany

http://spir.wikidot.com/
_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] parse text file

Reply via email to