On Feb 2, 2012, at 2:24 PM, peter wrote: > I am reading some text from a web site, using f=urllib.urlopen(....), > and then s=f.read() > > I then extract a bit of 's' as s1, s1 contains "Na Ponta Do Pé" > > The é is encoded in a single byte as 0XE9. > > If I do IS_SLUG.urlify(s1) it throws and error because 0XE9 is not a > valid character. I believe the encoding is ansii. I have tried all > manner of encoding and decoding but cannot get anything to work. If I > print s1 to the console or a file, then it works fine. But most python > character operations fail, presumably because they are expecting utf-8 > which encodes é as two bytes. > > > If I do > s1="Na Ponta Do Pé" > IS_SLUG.urlify(s1) > > There is no error. > > Clearly I could check for 0XE9 and convert it uniquely, but I wonder > if anyone could suggest a conversion that would work for any ansii > character. I have googled and experimented a lot on this with no > success.
The page you're reading is encoded as Latin-1. You need to decode it first.