On Feb 2, 2012, at 2:24 PM, peter wrote:

> I am reading some text from a web site, using f=urllib.urlopen(....),
> and then s=f.read()
> 
> I then extract a bit of 's' as s1, s1 contains "Na Ponta Do Pé"
> 
> The é is encoded in a single byte as 0XE9.
> 
> If I do IS_SLUG.urlify(s1) it throws and error because 0XE9 is not a
> valid character. I believe the encoding is ansii. I have tried all
> manner of encoding and decoding but cannot get anything to work. If I
> print s1 to the console or a file, then it works fine. But most python
> character operations fail, presumably because they are expecting utf-8
> which encodes é as two bytes.
> 
> 
> If I do
> s1="Na Ponta Do Pé"
> IS_SLUG.urlify(s1)
> 
> There is no error.
> 
> Clearly I could check for 0XE9 and convert it uniquely, but I wonder
> if anyone could suggest a conversion that would work for any ansii
> character. I have googled and experimented a lot on this with no
> success.

The page you're reading is encoded as Latin-1. You need to decode it first.

Reply via email to