Jeffrey C. Jacobs added the comment: Thanks Matthew and sorry to put you through more work; I just wanted to verify exactly which unicode (UTF-16 I take it) were being used to verify if the UNICODE standard expected them to be treated as unique words or single letters within a word. Sanskrit is an alphabet, not an ideograph so each symbol is considered a letter. So I believe your implementation is correct and yes, you are right, re is at fault. There are just accenting characters and letters in that sequence so they should be interpreted as a single word of 6 letters, as you determine, and not one of the first letter. Mind you, I misinterpreted msg190100 in that I thought you were using findall in which case the answer should be 1, but as far as length of extraction, yes, 6, I totally agree. Sorry for the misunderstanding. http://www.unicode.org/charts/PDF/U0900.pdf contains the code chart for Hindi.
---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue1693050> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com