On Tue, 2 Feb 2010 22:56:22 +0100 Norman Khine <nor...@khine.net> wrote:
> i am no expert, but there seems to be a bigger difference. > > with repr(), i get: > Sat\\xe9re Maw\\xe9 > > where as you get > > Sat\xc3\xa9re Maw\xc3\xa9 > > repr()'s > é == \\xe9 > whereas on your version > é == \xc3\xa9 This is a rather complicated issue mixing python str, unicode string, and their repr(). Kent is right in that the *python string* "\xc3\xa9" is the utf8 formatted representation of 'é' (2 bytes). While \xe9 is the *unicode code* for 'é', which should only appear in a unicode string. So: unicode.encode(u"\u00e9", "utf8") == "\xc3\xa9" or more simply: u"\u00e9".encode("utf8") == "\xc3\xa9" Conversely: unicode("\xc3\xa9", "utf8") == u"\u00e9" -- decoding The question is: what do you want to do with the result? You'll need either the utf8 form "\xc3\xa9" (for output) or the unicode string u"\u00e9" (for processing). But what you actually get is a kind of mix, actually the (python str) repr of a unicode string. > also, i still get an empty list when i run the code as suggested. ? Strange. Have you checked the re.DOTALL? (else regex patterns stop matching at \n by default) Denis ________________________________ la vita e estrany http://spir.wikidot.com/ _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor