Re: Unicode regex and Hindi language

MRAB Sat, 29 Nov 2008 10:22:18 -0800

Terry Reedy wrote:

Martin v. Löwis wrote:
To be fair to Python (and SRE), SRE predates TR#18 (IIRC) - atleast
annex C was added somewhere between revision 6 and 9, i.e. in early
2004. Python's current definition of \w is a straight-forward extension
of the historical \w definition (of Perl, I believe), which,
unfortunately, fails to recognize some of the Unicode subtleties.
I agree about not dumping on the past. When unicode support was addedto re, it was a somewhat experimental advance over bytes-only re. Nowthat Python has spread to south Asia as well as east Asia, it is time toadvance it further. I think this is especially important for 3.0, whichwill attract such users with the option of native identifiers. Reshould be able to recognize Python identifiers as words. I care notwhether the patch is called a fix or an update.
I have no personal need for this at the moment but it just happens thatI studied Sanskrit a bit some years ago and understand the script andcould explain why at least some 'marks' are really 'letters'. There areseveral other south Asian scripts descended from Devanagari, andincluded in Unicode, that use the same or similar vowel mark system. Soupdating Python's idea of a Unicode word will help users of severallanguages and make it more of a world language.
I presume that not viewing letter marks as part of words would affectHebrew and Arabic also.
I wonder if the current rule also affect European words with accentswritten as separate marks instead of as part of combined characters. Forinstance, if Martin's last name is written 'L' 'o' 'diaresis mark' 'w''i' 's' (6 chars) instead of 'L' 'o with diaresis' 'w' 'i' 's' (5chars), is it still recognized as a word? (I don't know how to do theinput to do the test.)
I notice from the manual "All identifiers are converted into the normalform NFC while parsing; comparison of identifiers is based on NFC." IfNFC used accented letters, then the issue is finesses away for Europeanwords simply because Unicode includes includes combined characters forEuropean scripts but not for south Asian scripts.

Does that mean that the re module will need to convert both the patternand the text to be searched into NFC form first? And I'm still not clearwhether \w, when used on a string consisting of Lo followed by Mc,should match Lo and then Mc (one codepoint at a time) or together (onecharacter at a time, where a character consists of some base charactercodepoint possibly followed by modifier codepoints).


I ask because I'm working on the re module at the moment.
--
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode regex and Hindi language

Reply via email to