Re: Unicode regex and Hindi language

Jerry Hill Fri, 28 Nov 2008 08:37:06 -0800

On Fri, Nov 28, 2008 at 10:47 AM, Shiao <[EMAIL PROTECTED]> wrote:
> The regex below identifies words in all languages I tested, but not in
> Hindi:
>
> # -*- coding: utf-8 -*-
>
> import re
> pat = re.compile('^(\w+)$', re.U)
> langs = ('English', '中文', 'हिन्दी')


I think the problem is that the Hindi Text contains both alphanumeric
and non-alphanumeric characters.  I'm not very familiar with Hindi,
much less how it's held in unicode, but take a look at the output of
this code:

# -*- coding: utf-8 -*-
import unicodedata as ucd

langs = (u'English', u'中文', u'हिन्दी')
for lang in langs:
    print lang
    for char in lang:
        print "\t %s %s (%s)" % (char, ucd.name(char), ucd.category(char))

Output:

English
         E LATIN CAPITAL LETTER E (Lu)
         n LATIN SMALL LETTER N (Ll)
         g LATIN SMALL LETTER G (Ll)
         l LATIN SMALL LETTER L (Ll)
         i LATIN SMALL LETTER I (Ll)
         s LATIN SMALL LETTER S (Ll)
         h LATIN SMALL LETTER H (Ll)
中文
         中 CJK UNIFIED IDEOGRAPH-4E2D (Lo)
         文 CJK UNIFIED IDEOGRAPH-6587 (Lo)
हिन्दी
         ह DEVANAGARI LETTER HA (Lo)
         ि DEVANAGARI VOWEL SIGN I (Mc)
         न DEVANAGARI LETTER NA (Lo)
         ् DEVANAGARI SIGN VIRAMA (Mn)
         द DEVANAGARI LETTER DA (Lo)
         ी DEVANAGARI VOWEL SIGN II (Mc)

From that, we see that there are some characters in the Hindi string
that aren't letters (they're not in unicode category L), but are
instead marks (unicode category M).
--
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode regex and Hindi language

Reply via email to