Re: [Tutor] regex: matching unicode

Steven D'Aprano Sat, 22 Dec 2012 20:17:10 -0800

On 23/12/12 07:53, Albert-Jan Roskam wrote:

Hi,


Is the code below the only/shortest way to match unicode characters?



No. You could install a more Unicode-aware regex engine, and use it instead
of Python's re module, where Unicode support is at best only partial.

Try this one:

http://pypi.python.org/pypi/regex

and report any issues to the author.

I would like to match whatever is defined as a character in the unicode
reference database. So letters in the broadest sense of the word,


Well, not really, actually letters in the sense of the Unicode reference
database :-)

In the above regex module, I think you could write:

'\p{Alphabetic}'

or

'\p{L}'

but don't quote me on this.

but not digits, underscore or whitespace. Until just now, I was convinced
that the re.UNICODE flag generalized the [a-z] class to all unicode letters,
and that the absence of re.U was an implicit 're.ASCII'. Apparently that
mental model was *wrong*.
But [^\W\s\d_]+ is kind of hard to read/write.


Of course it is. It's a regex.

import re
s = unichr(956)  # mu sign
m = re.match(ur"[^\W\s\d_]+", s, re.I | re.U)


Unfortunately that matches too much: in Python 2.7, it matches 340 non-letter
characters. Run this to see them:

import re
import unicodedata
MAXNUM = 0x10000   # one more than maximum unichr in Python "narrow builds"
regex = re.compile("[^\W\s\d_]+", re.I | re.U)
LETTERS = 'L|Ll|Lm|Lo|Lt|Lu'.split('|')
failures = []
kinds = set()
for c in map(unichr, range(MAXNUM)):
    if bool(re.match(regex, c)) != (unicodedata.category(c) in LETTERS):
        failures.append(c)
        kinds.add(unicodedata.category(c))

print kinds, len(failures)


The failures are all numbers with category Nl or No ("letterlike numeric
character" and "numeric character of other type"). You can see them with:

for c in failures:
    print c, unicodedata.category(c), unicodedata.name(c)



I won't show the full output, but a same sample includes:

² No SUPERSCRIPT TWO
¼ No VULGAR FRACTION ONE QUARTER
৴ No BENGALI CURRENCY NUMERATOR ONE
፹ No ETHIOPIC NUMBER EIGHTY
ᛮ Nl RUNIC ARLAUG SYMBOL
Ⅲ Nl ROMAN NUMERAL THREE


so you will probably have to post-process your matching results to exclude
these false-positives. Or just accept them.

If you have a "wide build", or Python 3.3, you can extend the test to the
full Unicode range of 0x110000. When I do that, I find 684 false matches,
all in category Nl and No.



--
Steven
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] regex: matching unicode

Reply via email to