On Nov 14, 12:30 pm, "Mark Tolonen" <[EMAIL PROTECTED]> wrote: > "Mark Tolonen" <[EMAIL PROTECTED]> wrote in message > > news:[EMAIL PROTECTED] > > > > > > > "Shiao" <[EMAIL PROTECTED]> wrote in message > >news:[EMAIL PROTECTED] > >> Hello, > >> I'm trying to build a regex in python to identify punctuation > >> characters in all the languages. Some regex implementations support an > >> extended syntax \p{P} that does just that. As far as I know, python re > >> doesn't. Any idea of a possible alternative? > > >> Apart from manually including the punctuation character range for each > >> and every language, I don't see how this can be done. > > >> Thank in advance for any suggestions. > > >> John > > > You can always build your own pattern. Something like (Python 3.0rc2): > > >>>> import unicodedata > > Po=''.join(chr(x) for x in range(65536) if unicodedata.category(chr(x)) == > > 'Po') > >>>> import re > >>>> r=re.compile('['+Po+']') > >>>> x='我是美國人。' > >>>> x > > '我是美國人。' > >>>> r.findall(x) > > ['。'] > > > -Mark > > This was an interesting problem. Need to escape \ and ] to find all the > punctuation correctly, and it turns out those characters are sequential in > the Unicode character set, so ] was coincidentally escaped in my first > attempt. > > IDLE 3.0rc2>>> import unicodedata as u > >>> A=''.join(chr(i) for i in range(65536)) > >>> P=''.join(chr(i) for i in range(65536) if u.category(chr(i))[0]=='P') > >>> len(A) > 65536 > >>> len(P) > 491 > >>> len(re.findall('['+P+']',A)) # ] was naturally > >>> escaped > 490 > >>> set(P)-set(re.findall('['+P+']',A)) # so only missing \ > {'\\'} > >>> P=P.replace('\\','\\\\').replace(']','\\]') # escape both of them. > >>> len(re.findall('['+P+']',A)) > > 491 > > -Mark
Mark, Many thanks. I feel almost ashamed I got away with it so easily :-) -- http://mail.python.org/mailman/listinfo/python-list