On Thu, Oct 9, 2008 at 4:54 PM, Johannes Bauer <[EMAIL PROTECTED]> wrote: > Hello group, > > Now when I take "website" directly from the parser, everything is fine. > However I want to do some modifications before I parse it, namely UTF-8 > modifications in the style: > > website = website.replace(u"föö", u"bär")
That's not utf-8, that's unicode. Even if your file is saved as utf-8, you're telling python to convert those from utf-8 encoded bytes to unicode characters, by prefixing them with 'u'. > Therefore, after fetching the web site content, I have to convert it to > UTF-8 first, modify it and convert it back: You have to convert it to unicode if and only if you are doing manipulation with unicode stings. > website = website.decode("latin1") > website = website.replace(u"föö", u"bär") > website = website.encode("latin1") > > This is incredibly ugly IMHO, as I would really like the parser to just > accept UTF-8 input. However when I omit the reecoding to latin1: You could just use the precise Latin-1 byte strings you'd like to replace: website = website.replace("f\xf6\xf6", "b\xe4r") Or, you could set the encoding of your source file to Latin-1, by putting the following on the first or second line of your source file: # -*- coding: Latin-1 -*- Then use the appropriate literals in your source code, making sure that you save it as Latin-1 in your editor of choice. Truthfully, though, I think your current approach really is the right one. Decode to unicode character strings as soon as they come into your program, manipulate them as unicode, then select your preferred encoding when you write them back out. It's explicit, and only takes two lines of code. -- Jerry -- http://mail.python.org/mailman/listinfo/python-list