Re: html escape sequences

2005-03-18 Thread Will McGugan
Leif K-Brooks wrote:
Will McGugan wrote:
I'd like to replace html escape sequences, like   and ' with 
single characters. Is there a dictionary defined somewhere I can use 
to replace these sequences?

How about this?
import re
from htmlentitydefs import name2codepoint
_entity_re = re.compile(r'&(?:(#)(\d+)|([^;]+));')
def _repl_func(match):
if match.group(1): # Numeric character reference
return unichr(int(match.group(2)))
else:
return unichr(name2codepoint[match.group(3)])
def handle_html_entities(string):
return _entity_re.sub(_repl_func, string)
muchas gracias!
Will McGugan
--
http://mail.python.org/mailman/listinfo/python-list


Re: html escape sequences

2005-03-18 Thread Leif K-Brooks
Will McGugan wrote:
I'd like to replace html escape sequences, like   and ' with 
single characters. Is there a dictionary defined somewhere I can use to 
replace these sequences?
How about this?
import re
from htmlentitydefs import name2codepoint
_entity_re = re.compile(r'&(?:(#)(\d+)|([^;]+));')
def _repl_func(match):
if match.group(1): # Numeric character reference
return unichr(int(match.group(2)))
else:
return unichr(name2codepoint[match.group(3)])
def handle_html_entities(string):
return _entity_re.sub(_repl_func, string)
--
http://mail.python.org/mailman/listinfo/python-list


html escape sequences

2005-03-18 Thread Will McGugan
Hi,
I'd like to replace html escape sequences, like   and ' with 
single characters. Is there a dictionary defined somewhere I can use to 
replace these sequences?

Thanks,
Will McGugan
--
http://mail.python.org/mailman/listinfo/python-list


Re: converting html escape sequences to unicode characters

2004-12-10 Thread Craig Ringer
On Fri, 2004-12-10 at 16:09, Craig Ringer wrote:
> On Fri, 2004-12-10 at 08:36, harrelson wrote:
> > I have a list of about 2500 html escape sequences (decimal) that I need
> > to convert to utf-8.  Stuff like:
> 
> I'm pretty sure this somewhat horrifying code does it, but is probably
> an example of what not to do:

It is. Sorry. I initially misread Kent Johnson's post. He just used
'unichr()'. Colour me an idiot. If you ever need to know the hard way to
build a unicode character...

--
Craig Ringer

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: converting html escape sequences to unicode characters

2004-12-10 Thread Craig Ringer
On Fri, 2004-12-10 at 08:36, harrelson wrote:
> I have a list of about 2500 html escape sequences (decimal) that I need
> to convert to utf-8.  Stuff like:

I'm pretty sure this somewhat horrifying code does it, but is probably
an example of what not to do:

>>> escapeseq = '비'
>>> uescape = ("\\u%x" % int(escapeseq[2:-1])).decode("unicode_escape")
>>> uescape
u'\ube44'
>>> print uescape
비
(I don't seem to have the font for it, but I think that's right - my
terminal font seems to show it correctly).

I just get the decimal value of the escape, format it as a Python
unicode hex escape sequence, and tell Python to interpret it as an
escaped unicode string.

>>> entities = ['비', '행', '기', '로',
'보', '낼', '거', '에', '요', '내',
'면', '금', '이', '얼', '마', '지',
'잠']
>>> def unescape(escapeseq):
... return ("\\u%x" % int(escapeseq[2:-1])).decode("unicode_escape")
...
>>> print ' '.join([ unescape(x) for x in entities ])
비 행 기 로 보 낼 거 에 요 내 면 금 이 얼 마 지 잠

--
Craig Ringer

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: converting html escape sequences to unicode characters

2004-12-09 Thread Kent Johnson
harrelson wrote:
I have a list of about 2500 html escape sequences (decimal) that I need
to convert to utf-8.  Stuff like:
비
행
기
로
보
낼
거
에
요
내
면
금
이
얼
마
지
잠
Anyone know what the decimal is representing?  It doesn't seem to
equate to a unicode codepoint...
In well-formed HTML (!) these should be the decimal values of Unicode 
characters. See
http://www.w3.org/TR/html4/charset.html#h-5.3.1
These characters appear to be Hangul Syllables:
http://www.unicode.org/charts/PDF/UAC00.pdf
import unicodedata
nums = [
48708,
54665,
44592,
47196,
48372,
45244,
44144,
50640,
50836,
45236,
47732,
44552,
51060,
50620,
47560,
51648,
51104,
]
for num in nums:
print num, unicodedata.name(unichr(num), 'Unknown')
=>
48708 HANGUL SYLLABLE BI
54665 HANGUL SYLLABLE HAENG
44592 HANGUL SYLLABLE GI
47196 HANGUL SYLLABLE RO
48372 HANGUL SYLLABLE BO
45244 HANGUL SYLLABLE NAEL
44144 HANGUL SYLLABLE GEO
50640 HANGUL SYLLABLE E
50836 HANGUL SYLLABLE YO
45236 HANGUL SYLLABLE NAE
47732 HANGUL SYLLABLE MYEON
44552 HANGUL SYLLABLE GEUM
51060 HANGUL SYLLABLE I
50620 HANGUL SYLLABLE EOL
47560 HANGUL SYLLABLE MA
51648 HANGUL SYLLABLE JI
51104 HANGUL SYLLABLE JAM
Kent
--
http://mail.python.org/mailman/listinfo/python-list


converting html escape sequences to unicode characters

2004-12-09 Thread harrelson
I have a list of about 2500 html escape sequences (decimal) that I need
to convert to utf-8.  Stuff like:

비
행
기
로
보
낼
거
에
요
내
면
금
이
얼
마
지
잠

Anyone know what the decimal is representing?  It doesn't seem to
equate to a unicode codepoint...

culley

-- 
http://mail.python.org/mailman/listinfo/python-list