Regex for unicode letter characters

schickb Sat, 10 Jan 2009 18:01:11 -0800

I need a regex that will match strings containing only unicode letter
characters (not including numeric or the _ character). I was surprised
to find the 're' module does not include a special character class for
this already (python 2.6). Or did I miss something?


It seems like this would be a very common need. Is the following the
only option to generate the character class (based on an old post by
Martin v. Löwis )?

import unicodedata, sys

def letters():
    start = end = None
    result = []
    for index in xrange(sys.maxunicode + 1):
        c = unichr(index)
        if unicodedata.category(c)[0] == 'L':
            if start is None:
                start = end = c
            else:
                end = c
        elif start:
            if start == end:
                result.append(start)
            else:
                result.append(start + "-" + end)
            start = None
    return u'[' + u''.join(result) + u']'

Seems rather cumbersome.

-Brad
--
http://mail.python.org/mailman/listinfo/python-list

Regex for unicode letter characters

Reply via email to