Re: Identifying unicode punctuation characters with Python regex

Mark Tolonen Fri, 14 Nov 2008 03:35:38 -0800

"Mark Tolonen" <[EMAIL PROTECTED]> wrote in messagenews:[EMAIL PROTECTED]

"Shiao" <[EMAIL PROTECTED]> wrote in messagenews:[EMAIL PROTECTED]

Hello,
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?

Apart from manually including the punctuation character range for each
and every language, I don't see how this can be done.

Thank in advance for any suggestions.

John


You can always build your own pattern.  Something like (Python 3.0rc2):

import unicodedata

Po=''.join(chr(x) for x in range(65536) if unicodedata.category(chr(x)) =='Po')

import re
r=re.compile('['+Po+']')
x='我是美國人。'
x

'我是美國人。'

r.findall(x)

['。']

-Mark

This was an interesting problem. Need to escape \ and ] to find all thepunctuation correctly, and it turns out those characters are sequential inthe Unicode character set, so ] was coincidentally escaped in my firstattempt.


IDLE 3.0rc2

import unicodedata as u
A=''.join(chr(i) for i in range(65536))
P=''.join(chr(i) for i in range(65536) if u.category(chr(i))[0]=='P')
len(A)

len(P)

len(re.findall('['+P+']',A)) # ] was naturallyescaped

set(P)-set(re.findall('['+P+']',A))         # so only missing \

{'\\'}

P=P.replace('\\','\\\\').replace(']','\\]')   # escape both of them.
len(re.findall('['+P+']',A))

491

-Mark

--
http://mail.python.org/mailman/listinfo/python-list

Re: Identifying unicode punctuation characters with Python regex

Reply via email to