[issue1693050] \w not helpful for non-Roman scripts

Terry J. Reedy Fri, 28 Nov 2008 13:15:02 -0800

Terry J. Reedy <[EMAIL PROTECTED]> added the comment:

Vowel 'marks' are condensed vowel characters and are very much part of
words and do not separate words.  Python3 properly includes Mn and Mc as
identifier characters.


http://docs.python.org/dev/3.0/reference/lexical_analysis.html#identifiers-and-keywords

For instance, the word 'hindi' has 3 consonants 'h', 'n', 'd', 2 vowels
'i' and 'ii' (long i) following 'h' and 'd', and a null vowel (virama)
after 'n'. [The null vowel is needed because no vowel mark indicates the
default vowel short a.  So without it, the word would be hinadii.]
The difference between the devanagari vowel characters, used at the
beginning of words, and the vowel marks, used thereafter, is purely
graphical and not phonological.  In short, in the sanskrit family,
word = syllable+
syllable = vowel | consonant + vowel mark

From a clp post asking why re does not see hindi as a word:

हिन्दी
     ह DEVANAGARI LETTER HA (Lo)
     ि DEVANAGARI VOWEL SIGN I (Mc)
     न DEVANAGARI LETTER NA (Lo)
     ् DEVANAGARI SIGN VIRAMA (Mn)
     द DEVANAGARI LETTER DA (Lo)
     ी DEVANAGARI VOWEL SIGN II (Mc)

.isapha and possibly other unicode methods need fixing also
>>> 'हिन्दी'.isalpha()#2.x and 3.0
False

----------
nosy: +tjreedy
versions: +Python 3.1

_______________________________________
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1693050>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1693050] \w not helpful for non-Roman scripts

Reply via email to