On Fri, Nov 28, 2008 at 10:47 AM, Shiao <[EMAIL PROTECTED]> wrote: > The regex below identifies words in all languages I tested, but not in > Hindi: > > # -*- coding: utf-8 -*- > > import re > pat = re.compile('^(\w+)$', re.U) > langs = ('English', '中文', 'हिन्दी')
I think the problem is that the Hindi Text contains both alphanumeric and non-alphanumeric characters. I'm not very familiar with Hindi, much less how it's held in unicode, but take a look at the output of this code: # -*- coding: utf-8 -*- import unicodedata as ucd langs = (u'English', u'中文', u'हिन्दी') for lang in langs: print lang for char in lang: print "\t %s %s (%s)" % (char, ucd.name(char), ucd.category(char)) Output: English E LATIN CAPITAL LETTER E (Lu) n LATIN SMALL LETTER N (Ll) g LATIN SMALL LETTER G (Ll) l LATIN SMALL LETTER L (Ll) i LATIN SMALL LETTER I (Ll) s LATIN SMALL LETTER S (Ll) h LATIN SMALL LETTER H (Ll) 中文 中 CJK UNIFIED IDEOGRAPH-4E2D (Lo) 文 CJK UNIFIED IDEOGRAPH-6587 (Lo) हिन्दी ह DEVANAGARI LETTER HA (Lo) ि DEVANAGARI VOWEL SIGN I (Mc) न DEVANAGARI LETTER NA (Lo) ् DEVANAGARI SIGN VIRAMA (Mn) द DEVANAGARI LETTER DA (Lo) ी DEVANAGARI VOWEL SIGN II (Mc) From that, we see that there are some characters in the Hindi string that aren't letters (they're not in unicode category L), but are instead marks (unicode category M). -- http://mail.python.org/mailman/listinfo/python-list