Am 11.12.2010 18:33, schrieb Perry Johnson: > Python's re module does not support POSIX character classes, for > example [:alpha:]. It is, of course, trivial to simulate them using > character ranges when the text to be matched uses the ASCII character > set. Sadly, my problem is that I need to process Unicode text. The re > module has its own character classes that do support Unicode, however > they are not sufficient. > > I would find it extremely useful if there was information on the > Unicode code points that map to each of the POSIX character classes.
By definition, this is not possible. The POSIX character classes are locale-dependent, whereas the recommendation for Unicode regular expressions is that they are not (i.e. a Unicode regex character class should refer to the same characters independent from the locale). If you want to construct locale-dependent Unicode character classes, you should use this procedure: - iterate over all byte values (0..255) - perform the relevant locale-specific tests - decode each byte into Unicode, using the locale's encoding - construct a character class out of that Unfortunately, that will work only for single-byte encodings. I'm not aware of a procedure that does that for multi-byte strings. But perhaps you didn't mean "POSIX character class" in this literal way. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list