Re: Convert from unicode chars to HTML entities

Gabriel Genellina Sun, 28 Jan 2007 20:40:11 -0800

En Mon, 29 Jan 2007 00:05:24 -0300, Steven D'Aprano  
<[EMAIL PROTECTED]> escribió:


> I have a string containing Latin-1 characters:
>
> s = u"© and many more..."
>
> I want to convert it to HTML entities:
>
> result =>
> "&copy; and many more..."
>

Module htmlentitydefs contains the tables you're looking for, but you need  
a few transforms:

<code>
# -*- coding: iso-8859-15 -*-
 from htmlentitydefs import codepoint2name

unichr2entity = dict((unichr(code), u'&%s;' % name)
     for code,name in codepoint2name.iteritems()
     if code!=38) # exclude "&"

def htmlescape(text, d=unichr2entity):
     if u"&" in text:
         text = text.replace(u"&", u"&amp;")
     for key, value in d.iteritems():
         if key in text:
             text = text.replace(key, value)
     return text

print '%r' % htmlescape(u'hello')
print '%r' % htmlescape(u'"©® áé&ö <²³>')
</code>

Output:
u'hello'
u'&quot;&copy;&reg; &aacute;&eacute;&amp;&ouml; &lt;&sup2;&sup3;&gt;'

The result is an unicode object, with all known entities replaced. It does  
not handle missing, unknown entities - as the docs for htmlentitydefs say,  
"the definition provided here contains all the entities defined by XHTML  
1.0 that can be handled using simple textual substitution in the Latin-1  
character set (ISO-8859-1)."

-- 
Gabriel Genellina

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Convert from unicode chars to HTML entities

Reply via email to