Thank you, now I can get the correct character. Now when I have the string abécd I can get ab then é thanks to your function and then cd. But how is it possible to know that cd is still the same word ?
Fabien > The character references indicate Unicode ordinals, not iso-8859-1 > characters. In your example it will give the proper character because > iso-8859-1 coincides with the first part of the Unicode ordinals, but > for character outside of iso-8859-1 it will fail. > > This should give you an idea: > > from htmlentitydefs import name2codepoint > ... > def handle_charref(self, name): > if name.startswith('x'): > num = int(name[1:], 16) > else: > num = int(name, 10) > print 'char:', repr(unichr(num)) > > def handle_entityref(self, name): > print 'char:', unichr(name2codepoint[name]) > > If your HTML may be illegal you should add some exception handling. > -- > Piet van Oostrum <p...@cs.uu.nl> > URL:http://pietvanoostrum.com[PGP 8DAE142BE17999C4] > Private email: p...@vanoostrum.org -- http://mail.python.org/mailman/listinfo/python-list