[issue1693050] \w not helpful for non-Roman scripts

Jeffrey C. Jacobs Wed, 29 May 2013 11:32:56 -0700

Jeffrey C. Jacobs added the comment:

Thanks Matthew and sorry to put you through more work; I just wanted to verify 
exactly which unicode (UTF-16 I take it) were being used to verify if the 
UNICODE standard expected them to be treated as unique words or single letters 
within a word.  Sanskrit is an alphabet, not an ideograph so each symbol is 
considered a letter.  So I believe your implementation is correct and yes, you 
are right, re is at fault.  There are just accenting characters and letters in 
that sequence so they should be interpreted as a single word of 6 letters, as 
you determine, and not one of the first letter.  Mind you, I misinterpreted 
msg190100 in that I thought you were using findall in which case the answer 
should be 1, but as far as length of extraction, yes, 6, I totally agree.  
Sorry for the misunderstanding.  http://www.unicode.org/charts/PDF/U0900.pdf 
contains the code chart for Hindi.


----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue1693050>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1693050] \w not helpful for non-Roman scripts

Reply via email to