Re: html escape sequences
Leif K-Brooks wrote: Will McGugan wrote: I'd like to replace html escape sequences, like and ' with single characters. Is there a dictionary defined somewhere I can use to replace these sequences? How about this? import re from htmlentitydefs import name2codepoint _entity_re = re.compile(r'&(?:(#)(\d+)|([^;]+));') def _repl_func(match): if match.group(1): # Numeric character reference return unichr(int(match.group(2))) else: return unichr(name2codepoint[match.group(3)]) def handle_html_entities(string): return _entity_re.sub(_repl_func, string) muchas gracias! Will McGugan -- http://mail.python.org/mailman/listinfo/python-list
Re: html escape sequences
Will McGugan wrote: I'd like to replace html escape sequences, like and ' with single characters. Is there a dictionary defined somewhere I can use to replace these sequences? How about this? import re from htmlentitydefs import name2codepoint _entity_re = re.compile(r'&(?:(#)(\d+)|([^;]+));') def _repl_func(match): if match.group(1): # Numeric character reference return unichr(int(match.group(2))) else: return unichr(name2codepoint[match.group(3)]) def handle_html_entities(string): return _entity_re.sub(_repl_func, string) -- http://mail.python.org/mailman/listinfo/python-list
html escape sequences
Hi, I'd like to replace html escape sequences, like and ' with single characters. Is there a dictionary defined somewhere I can use to replace these sequences? Thanks, Will McGugan -- http://mail.python.org/mailman/listinfo/python-list
Re: converting html escape sequences to unicode characters
On Fri, 2004-12-10 at 16:09, Craig Ringer wrote: > On Fri, 2004-12-10 at 08:36, harrelson wrote: > > I have a list of about 2500 html escape sequences (decimal) that I need > > to convert to utf-8. Stuff like: > > I'm pretty sure this somewhat horrifying code does it, but is probably > an example of what not to do: It is. Sorry. I initially misread Kent Johnson's post. He just used 'unichr()'. Colour me an idiot. If you ever need to know the hard way to build a unicode character... -- Craig Ringer -- http://mail.python.org/mailman/listinfo/python-list
Re: converting html escape sequences to unicode characters
On Fri, 2004-12-10 at 08:36, harrelson wrote: > I have a list of about 2500 html escape sequences (decimal) that I need > to convert to utf-8. Stuff like: I'm pretty sure this somewhat horrifying code does it, but is probably an example of what not to do: >>> escapeseq = '비' >>> uescape = ("\\u%x" % int(escapeseq[2:-1])).decode("unicode_escape") >>> uescape u'\ube44' >>> print uescape 비 (I don't seem to have the font for it, but I think that's right - my terminal font seems to show it correctly). I just get the decimal value of the escape, format it as a Python unicode hex escape sequence, and tell Python to interpret it as an escaped unicode string. >>> entities = ['비', '행', '기', '로', '보', '낼', '거', '에', '요', '내', '면', '금', '이', '얼', '마', '지', '잠'] >>> def unescape(escapeseq): ... return ("\\u%x" % int(escapeseq[2:-1])).decode("unicode_escape") ... >>> print ' '.join([ unescape(x) for x in entities ]) 비 행 기 로 보 낼 거 에 요 내 면 금 이 얼 마 지 잠 -- Craig Ringer -- http://mail.python.org/mailman/listinfo/python-list
Re: converting html escape sequences to unicode characters
harrelson wrote: I have a list of about 2500 html escape sequences (decimal) that I need to convert to utf-8. Stuff like: 비 행 기 로 보 낼 거 에 요 내 면 금 이 얼 마 지 잠 Anyone know what the decimal is representing? It doesn't seem to equate to a unicode codepoint... In well-formed HTML (!) these should be the decimal values of Unicode characters. See http://www.w3.org/TR/html4/charset.html#h-5.3.1 These characters appear to be Hangul Syllables: http://www.unicode.org/charts/PDF/UAC00.pdf import unicodedata nums = [ 48708, 54665, 44592, 47196, 48372, 45244, 44144, 50640, 50836, 45236, 47732, 44552, 51060, 50620, 47560, 51648, 51104, ] for num in nums: print num, unicodedata.name(unichr(num), 'Unknown') => 48708 HANGUL SYLLABLE BI 54665 HANGUL SYLLABLE HAENG 44592 HANGUL SYLLABLE GI 47196 HANGUL SYLLABLE RO 48372 HANGUL SYLLABLE BO 45244 HANGUL SYLLABLE NAEL 44144 HANGUL SYLLABLE GEO 50640 HANGUL SYLLABLE E 50836 HANGUL SYLLABLE YO 45236 HANGUL SYLLABLE NAE 47732 HANGUL SYLLABLE MYEON 44552 HANGUL SYLLABLE GEUM 51060 HANGUL SYLLABLE I 50620 HANGUL SYLLABLE EOL 47560 HANGUL SYLLABLE MA 51648 HANGUL SYLLABLE JI 51104 HANGUL SYLLABLE JAM Kent -- http://mail.python.org/mailman/listinfo/python-list
converting html escape sequences to unicode characters
I have a list of about 2500 html escape sequences (decimal) that I need to convert to utf-8. Stuff like: 비 행 기 로 보 낼 거 에 요 내 면 금 이 얼 마 지 잠 Anyone know what the decimal is representing? It doesn't seem to equate to a unicode codepoint... culley -- http://mail.python.org/mailman/listinfo/python-list