Re: Identifying unicode punctuation characters with Python regex

Shiao Fri, 14 Nov 2008 06:10:43 -0800

On Nov 14, 12:30 pm, "Mark Tolonen" <[EMAIL PROTECTED]> wrote:
> "Mark Tolonen" <[EMAIL PROTECTED]> wrote in message
>
> news:[EMAIL PROTECTED]
>
>
>
>
>
> > "Shiao" <[EMAIL PROTECTED]> wrote in message
> >news:[EMAIL PROTECTED]
> >> Hello,
> >> I'm trying to build a regex in python to identify punctuation
> >> characters in all the languages. Some regex implementations support an
> >> extended syntax \p{P} that does just that. As far as I know, python re
> >> doesn't. Any idea of a possible alternative?
>
> >> Apart from manually including the punctuation character range for each
> >> and every language, I don't see how this can be done.
>
> >> Thank in advance for any suggestions.
>
> >> John
>
> > You can always build your own pattern.  Something like (Python 3.0rc2):
>
> >>>> import unicodedata
> > Po=''.join(chr(x) for x in range(65536) if unicodedata.category(chr(x)) ==
> > 'Po')
> >>>> import re
> >>>> r=re.compile('['+Po+']')
> >>>> x='我是美國人。'
> >>>> x
> > '我是美國人。'
> >>>> r.findall(x)
> > ['。']
>
> > -Mark
>
> This was an interesting problem.  Need to escape \ and ] to find all the
> punctuation correctly, and it turns out those characters are sequential in
> the Unicode character set, so ] was coincidentally escaped in my first
> attempt.
>
> IDLE 3.0rc2>>> import unicodedata as u
> >>> A=''.join(chr(i) for i in range(65536))
> >>> P=''.join(chr(i) for i in range(65536) if u.category(chr(i))[0]=='P')
> >>> len(A)
> 65536
> >>> len(P)
> 491
> >>> len(re.findall('['+P+']',A))                     # ] was naturally
> >>> escaped
> 490
> >>> set(P)-set(re.findall('['+P+']',A))         # so only missing \
> {'\\'}
> >>> P=P.replace('\\','\\\\').replace(']','\\]')   # escape both of them.
> >>> len(re.findall('['+P+']',A))
>
> 491
>
> -Mark


Mark,
Many thanks. I feel almost ashamed I got away with it so easily :-)
--
http://mail.python.org/mailman/listinfo/python-list

Re: Identifying unicode punctuation characters with Python regex

Reply via email to