[web2py] problem with character encoding

2012-02-02 Thread peter
I am reading some text from a web site, using f=urllib.urlopen(),
and then s=f.read()

I then extract a bit of 's' as s1, s1 contains Na Ponta Do Pé

The é is encoded in a single byte as 0XE9.

If I do IS_SLUG.urlify(s1) it throws and error because 0XE9 is not a
valid character. I believe the encoding is ansii. I have tried all
manner of encoding and decoding but cannot get anything to work. If I
print s1 to the console or a file, then it works fine. But most python
character operations fail, presumably because they are expecting utf-8
which encodes é as two bytes.


If I do
s1=Na Ponta Do Pé
IS_SLUG.urlify(s1)

There is no error.

Clearly I could check for 0XE9 and convert it uniquely, but I wonder
if anyone could suggest a conversion that would work for any ansii
character. I have googled and experimented a lot on this with no
success.

Thanks
Peter


Re: [web2py] problem with character encoding

2012-02-02 Thread Jonathan Lundell
On Feb 2, 2012, at 2:24 PM, peter wrote:

 I am reading some text from a web site, using f=urllib.urlopen(),
 and then s=f.read()
 
 I then extract a bit of 's' as s1, s1 contains Na Ponta Do Pé
 
 The é is encoded in a single byte as 0XE9.
 
 If I do IS_SLUG.urlify(s1) it throws and error because 0XE9 is not a
 valid character. I believe the encoding is ansii. I have tried all
 manner of encoding and decoding but cannot get anything to work. If I
 print s1 to the console or a file, then it works fine. But most python
 character operations fail, presumably because they are expecting utf-8
 which encodes é as two bytes.
 
 
 If I do
 s1=Na Ponta Do Pé
 IS_SLUG.urlify(s1)
 
 There is no error.
 
 Clearly I could check for 0XE9 and convert it uniquely, but I wonder
 if anyone could suggest a conversion that would work for any ansii
 character. I have googled and experimented a lot on this with no
 success.

The page you're reading is encoded as Latin-1. You need to decode it first.