Re: Special chars with HTMLParser

Fafounet Wed, 05 Aug 2009 06:06:33 -0700

Thank you, now I can get the correct character.

Now when I have the string ab&#xE9;cd I can get ab then é thanks to
your function and then cd. But how is it possible to know that cd is
still the same word ?



Fabien


> The character references indicate Unicode ordinals, not iso-8859-1
> characters. In your example it will give the proper character because
> iso-8859-1 coincides with the first part of the Unicode ordinals, but
> for character outside of iso-8859-1 it will fail.
>
> This should give you an idea:
>
> from htmlentitydefs import name2codepoint
> ...
>     def handle_charref(self, name):
>         if name.startswith('x'):
>             num = int(name[1:], 16)
>         else:
>             num = int(name, 10)
>         print 'char:', repr(unichr(num))
>
>     def handle_entityref(self, name):
>         print 'char:', unichr(name2codepoint[name])
>
> If your HTML may be illegal you should add some exception handling.
> --
> Piet van Oostrum <[email protected]>
> URL:http://pietvanoostrum.com[PGP 8DAE142BE17999C4]
> Private email: [email protected]

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Special chars with HTMLParser

Reply via email to