Terry Reedy wrote:
Jerry Hill wrote:
On Fri, Nov 28, 2008 at 10:47 AM, Shiao <[EMAIL PROTECTED]> wrote:
The regex below identifies words in all languages I tested, but not in
Hindi:

# -*- coding: utf-8 -*-

import re
pat = re.compile('^(\w+)$', re.U)
langs = ('English', '中文', 'हिन्दी')

I think the problem is that the Hindi Text contains both alphanumeric
and non-alphanumeric characters.  I'm not very familiar with Hindi,
much less how it's held in unicode, but take a look at the output of
this code:

# -*- coding: utf-8 -*-
import unicodedata as ucd

langs = (u'English', u'中文', u'हिन्दी')
for lang in langs:
    print lang
    for char in lang:
print "\t %s %s (%s)" % (char, ucd.name(char), ucd.category(char))

Output:

English
     E LATIN CAPITAL LETTER E (Lu)
     n LATIN SMALL LETTER N (Ll)
     g LATIN SMALL LETTER G (Ll)
     l LATIN SMALL LETTER L (Ll)
     i LATIN SMALL LETTER I (Ll)
     s LATIN SMALL LETTER S (Ll)
     h LATIN SMALL LETTER H (Ll)
中文
     中 CJK UNIFIED IDEOGRAPH-4E2D (Lo)
     文 CJK UNIFIED IDEOGRAPH-6587 (Lo)
हिन्दी
     ह DEVANAGARI LETTER HA (Lo)
     ि DEVANAGARI VOWEL SIGN I (Mc)
     न DEVANAGARI LETTER NA (Lo)
     ् DEVANAGARI SIGN VIRAMA (Mn)
     द DEVANAGARI LETTER DA (Lo)
     ी DEVANAGARI VOWEL SIGN II (Mc)

From that, we see that there are some characters in the Hindi string
that aren't letters (they're not in unicode category L), but are
instead marks (unicode category M).

Python3.0 allows unicode identifiers. Mn and Mc characters are included in the set of allowed alphanumeric characters. 'Hindi' is a word in both its native characters and in latin tranliteration.

http://docs.python.org/dev/3.0/reference/lexical_analysis.html#identifiers-and-keywords

re is too restrictive in its definition of 'word'. I suggest that OP (original poster) Shiao file a bug report at http://bugs.python.org

Should the Mc and Mn codepoints match \w in the re module even though u'हिन्दी'.isalpha() returns False (in Python 2.x, haven't tried Python 3.x)? Issue 1693050 said no. Perhaps someone with knowledge of Hindi could suggest how Python should handle it. I wouldn't want the re module to say one thing and the rest of the language to say another! :-)
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to