>>Is the code below the only/shortest way to match unicode characters? I would 
>>like to match whatever is defined as a character in the unicode reference 
>>database. So letters in the broadest sense of the word, but not digits, 
>>underscore or whitespace. Until just now, I was convinced that the re.UNICODE 
>>flag generalized the [a-z] class to all unicode letters, and that the absence 
>>of re.U was an implicit 're.ASCII'. Apparently that mental model was *wrong*.

>>But [^\W\s\d_]+ is kind of hard to read/write.
>>
>>import re
>>s = unichr(956)  # mu sign
>>m = re.match(ur"[^\W\s\d_]+", s, re.I | re.U)
>>
>>
>A thought would be to rely on the general category of the character, as listed 
>in the Unicode database. Unicodedata.category will give you what you need. 
>Here is a list of categories in the Unicode standard:
>
>
>http://www.fileformat.info/info/unicode/category/index.htm
>
>
>
>So, if you wanted only letters, you could say:
>
>
>def is_unicode_character(c):
>    assert len(c) == 1
>    return 'L' in unicodedata.category(c)


Hi everybody,

Thanks for your replies, they have been most insightful. For now the 
'unicodedata' approach works best for me. I need to validate a word and this is 
now a two-step approach. First, check if the first character is a (unicode) 
letter, second, do other checks with a regular regex (ie., no spaces, 
ampersands and whatnot). Using one regex would be more elegant though, but I 
got kinda intimidated by the hail of additional flags in the regex module.
Having unicode versions of the classes \d, \w, etc (let's call them \ud, \uw) 
would be cool.Here another useful way to use your (Hugo's) function. The 
Japanese hangul sign and the degree sign almost look the same!

import unicodedata

hangul = unichr(4363)
degree = unichr(176)

def isUnicodeChar(c):
  assert len(c) == 1
  c = c.decode("utf-8") if isinstance(c, str) else c
  return 'L' in unicodedata.category(c)

>>> isUnicodeChar(hangul)
True
>>> isUnicodeChar(degree)
False
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to