Re: Special chars with HTMLParser
Fafounet wrote: I am parsing a web page with special chars such as #xE9; (which stands for é). I know I can have the unicode character é from unicode (\xe9,iso-8859-1) but with those extra characters I don' t know. I tried to implement handle_charref within HTMLParser without success. Furthermore, if I have the data ab#xE9;cd, handle_data will get ab, handle_charref will get xe9 and then handle_data doesn't have the end of the string (cd). Any reason you can't use a tree based HTML parser like the one in lxml.html? That would eliminate this kind of problem altogether, as you'd always get a well-decoded unicode string from the tree content. Stefan -- http://mail.python.org/mailman/listinfo/python-list
Special chars with HTMLParser
Hello, I am parsing a web page with special chars such as #xE9; (which stands for é). I know I can have the unicode character é from unicode (\xe9,iso-8859-1) but with those extra characters I don' t know. I tried to implement handle_charref within HTMLParser without success. Furthermore, if I have the data ab#xE9;cd, handle_data will get ab, handle_charref will get xe9 and then handle_data doesn't have the end of the string (cd). Thank you for your help, Fabien -- http://mail.python.org/mailman/listinfo/python-list
Re: Special chars with HTMLParser
Fafounet fafou...@gmail.com (F) wrote: F Hello, F I am parsing a web page with special chars such as #xE9; (which F stands for é). F I know I can have the unicode character é from unicode F (\xe9,iso-8859-1) F but with those extra characters I don' t know. F I tried to implement handle_charref within HTMLParser without success. F Furthermore, if I have the data ab#xE9;cd, handle_data will get ab, F handle_charref will get xe9 and then handle_data doesn't have the end F of the string (cd). The character references indicate Unicode ordinals, not iso-8859-1 characters. In your example it will give the proper character because iso-8859-1 coincides with the first part of the Unicode ordinals, but for character outside of iso-8859-1 it will fail. This should give you an idea: from htmlentitydefs import name2codepoint ... def handle_charref(self, name): if name.startswith('x'): num = int(name[1:], 16) else: num = int(name, 10) print 'char:', repr(unichr(num)) def handle_entityref(self, name): print 'char:', unichr(name2codepoint[name]) If your HTML may be illegal you should add some exception handling. -- Piet van Oostrum p...@cs.uu.nl URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4] Private email: p...@vanoostrum.org -- http://mail.python.org/mailman/listinfo/python-list
Re: Special chars with HTMLParser
Thank you, now I can get the correct character. Now when I have the string ab#xE9;cd I can get ab then é thanks to your function and then cd. But how is it possible to know that cd is still the same word ? Fabien The character references indicate Unicode ordinals, not iso-8859-1 characters. In your example it will give the proper character because iso-8859-1 coincides with the first part of the Unicode ordinals, but for character outside of iso-8859-1 it will fail. This should give you an idea: from htmlentitydefs import name2codepoint ... def handle_charref(self, name): if name.startswith('x'): num = int(name[1:], 16) else: num = int(name, 10) print 'char:', repr(unichr(num)) def handle_entityref(self, name): print 'char:', unichr(name2codepoint[name]) If your HTML may be illegal you should add some exception handling. -- Piet van Oostrum p...@cs.uu.nl URL:http://pietvanoostrum.com[PGP 8DAE142BE17999C4] Private email: p...@vanoostrum.org -- http://mail.python.org/mailman/listinfo/python-list
Re: Special chars with HTMLParser
Fafounet fafou...@gmail.com (F) wrote: F Thank you, now I can get the correct character. F Now when I have the string ab#xE9;cd I can get ab then é thanks to F your function and then cd. But how is it possible to know that cd is F still the same word ? That depends on your definition of `word'. And that is language-dependent. What you normally do is collect the text in a (unicode) string variable. This happens in handle_data, handle_charref and handle_entityref. Then you check that the previously collected stuff was a word (e.g. consisting of Unicode letters), and that the new stuff also consists of letters. If your language has additional word constituents like - or ' you have to add this. You can do this with unicodedata.category or with a regular expression. If your locale is correct \w in a regular expression may be helpful. -- Piet van Oostrum p...@cs.uu.nl URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4] Private email: p...@vanoostrum.org -- http://mail.python.org/mailman/listinfo/python-list