Re: Special chars with HTMLParser

2009-08-07 Thread Stefan Behnel
Fafounet wrote:
 I am parsing a web page with special chars such as #xE9; (which
 stands for é).
 I know I can have the unicode character é from unicode
 (\xe9,iso-8859-1)
 but with those extra characters I don' t know.
 
 I tried to implement handle_charref within HTMLParser without success.
 Furthermore, if I have the data ab#xE9;cd, handle_data will get ab,
 handle_charref will get xe9 and then handle_data doesn't have the end
 of the string (cd).

Any reason you can't use a tree based HTML parser like the one in
lxml.html? That would eliminate this kind of problem altogether, as you'd
always get a well-decoded unicode string from the tree content.

Stefan
-- 
http://mail.python.org/mailman/listinfo/python-list


Special chars with HTMLParser

2009-08-05 Thread Fafounet
Hello,

I am parsing a web page with special chars such as #xE9; (which
stands for é).
I know I can have the unicode character é from unicode
(\xe9,iso-8859-1)
but with those extra characters I don' t know.

I tried to implement handle_charref within HTMLParser without success.
Furthermore, if I have the data ab#xE9;cd, handle_data will get ab,
handle_charref will get xe9 and then handle_data doesn't have the end
of the string (cd).

Thank you for your help,
Fabien
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Special chars with HTMLParser

2009-08-05 Thread Piet van Oostrum
 Fafounet fafou...@gmail.com (F) wrote:

F Hello,
F I am parsing a web page with special chars such as #xE9; (which
F stands for é).
F I know I can have the unicode character é from unicode
F (\xe9,iso-8859-1)
F but with those extra characters I don' t know.

F I tried to implement handle_charref within HTMLParser without success.
F Furthermore, if I have the data ab#xE9;cd, handle_data will get ab,
F handle_charref will get xe9 and then handle_data doesn't have the end
F of the string (cd).

The character references indicate Unicode ordinals, not iso-8859-1
characters. In your example it will give the proper character because
iso-8859-1 coincides with the first part of the Unicode ordinals, but
for character outside of iso-8859-1 it will fail.

This should give you an idea:

from htmlentitydefs import name2codepoint
...
def handle_charref(self, name):
if name.startswith('x'):
num = int(name[1:], 16)
else:
num = int(name, 10)
print 'char:', repr(unichr(num))

def handle_entityref(self, name):
print 'char:', unichr(name2codepoint[name])

If your HTML may be illegal you should add some exception handling.
-- 
Piet van Oostrum p...@cs.uu.nl
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: p...@vanoostrum.org
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Special chars with HTMLParser

2009-08-05 Thread Fafounet
Thank you, now I can get the correct character.

Now when I have the string ab#xE9;cd I can get ab then é thanks to
your function and then cd. But how is it possible to know that cd is
still the same word ?


Fabien


 The character references indicate Unicode ordinals, not iso-8859-1
 characters. In your example it will give the proper character because
 iso-8859-1 coincides with the first part of the Unicode ordinals, but
 for character outside of iso-8859-1 it will fail.

 This should give you an idea:

 from htmlentitydefs import name2codepoint
 ...
     def handle_charref(self, name):
         if name.startswith('x'):
             num = int(name[1:], 16)
         else:
             num = int(name, 10)
         print 'char:', repr(unichr(num))

     def handle_entityref(self, name):
         print 'char:', unichr(name2codepoint[name])

 If your HTML may be illegal you should add some exception handling.
 --
 Piet van Oostrum p...@cs.uu.nl
 URL:http://pietvanoostrum.com[PGP 8DAE142BE17999C4]
 Private email: p...@vanoostrum.org

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Special chars with HTMLParser

2009-08-05 Thread Piet van Oostrum
 Fafounet fafou...@gmail.com (F) wrote:

F Thank you, now I can get the correct character.
F Now when I have the string ab#xE9;cd I can get ab then é thanks to
F your function and then cd. But how is it possible to know that cd is
F still the same word ?

That depends on your definition of `word'. And that is
language-dependent. 

What you normally do is collect the text in a (unicode) string variable.
This happens in handle_data, handle_charref and handle_entityref.
Then you check that the previously collected stuff was a word (e.g.
consisting of Unicode letters), and that the new stuff also consists of
letters. If your language has additional word constituents like - or '
you have to add this.

You can do this with unicodedata.category or with a regular
expression. If your locale is correct \w in a regular expression may be
helpful. 
-- 
Piet van Oostrum p...@cs.uu.nl
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: p...@vanoostrum.org
-- 
http://mail.python.org/mailman/listinfo/python-list