Re: Special chars with HTMLParser

Stefan Behnel Fri, 07 Aug 2009 00:01:29 -0700

Fafounet wrote:
> I am parsing a web page with special chars such as &#xE9; (which
> stands for é).
> I know I can have the unicode character é from unicode
> ("\xe9","iso-8859-1")
> but with those extra characters I don' t know.
> 
> I tried to implement handle_charref within HTMLParser without success.
> Furthermore, if I have the data ab&#xE9;cd, handle_data will get "ab",
> handle_charref will get xe9 and then handle_data doesn't have the end
> of the string ("cd").


Any reason you can't use a tree based HTML parser like the one in
lxml.html? That would eliminate this kind of problem altogether, as you'd
always get a well-decoded unicode string from the tree content.

Stefan
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Special chars with HTMLParser

Reply via email to