Re: unescape HTML entities

Klaus Alexander Seistrup Sat, 28 Oct 2006 14:48:30 -0700

Rares Vernica wrote:

> How can I unescape HTML entities like "&nbsp;"?
>
> I know about xml.sax.saxutils.unescape() but it only deals with
> "&amp;", "&lt;", and "&gt;".
>
> Also, I know about htmlentitydefs.entitydefs, but not only this 
> dictionary is the opposite of what I need, it does not have 
> "&nbsp;".


How about something like:

#v+
#!/usr/bin/env/python
'''dehtml.py'''

import re
import htmlentitydef

myrx = re.compile('&(' + '|'.join(htmlentitydefs.name2codepoint.keys()) + ');')

def dehtml(s):
    return re.sub(
        myrx,
        lambda m: unichr(htmlentitydefs.name2codepoint[m.group(1)]),
        s
    )
# end def dehtml

if __name__ == '__main__':
    import sys
    print dehtml(sys.stdin.read()).encode('utf-8')
# end if

#v-

E.g.:

#v+

$ echo 'fr&aelig;kke fr&oslash;l&aring;r' | ./dehtml.py
frække frølår
$ 

#v-

-- 
Klaus Alexander Seistrup
Copenhagen, Denmark, EU
http://klaus.seistrup.dk/
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: unescape HTML entities

Reply via email to