Matthew Woodcraft <[EMAIL PROTECTED]> wrote: > Gabriel Genellina wrote: >> Steven D'Aprano wrote: > >>> I have searched for, but been unable to find, standard library >>> functions that escapes or unescapes URLs. Are there any such >>> functions? > >> Yes: cgi.escape/unescape, and xml.sax.saxutils.escape/unescape. > > I don't see a cgi.unescape in the standard library. > > I don't think xml.sax.saxutils.unescape will be suitable for Steven's > purpose, because it doesn't process numeric character references (which > are both legal and seen in the wild in /href/ attributes). >
Here's the code I use. It handles decimal and hex entity references as well as all html named entities. import re from htmlentitydefs import name2codepoint name2codepoint = name2codepoint.copy() name2codepoint['apos']=ord("'") EntityPattern = re.compile('&(?:#(\d+)|(?:#x([\da-fA-F]+))|([a-zA-Z]+));') def decodeEntities(s, encoding='utf-8'): def unescape(match): code = match.group(1) if code: return unichr(int(code, 10)) else: code = match.group(2) if code: return unichr(int(code, 16)) else: code = match.group(3) if code in name2codepoint: return unichr(name2codepoint[code]) return match.group(0) if isinstance(s, str): s = s.decode(encoding) return EntityPattern.sub(unescape, s) -- Duncan Booth http://kupuguy.blogspot.com -- http://mail.python.org/mailman/listinfo/python-list