Terry Reedy wrote:
Martin v. Löwis wrote:

To be fair to Python (and SRE), SRE predates TR#18 (IIRC) - atleast
annex C was added somewhere between revision 6 and 9, i.e. in early
2004. Python's current definition of \w is a straight-forward extension
of the historical \w definition (of Perl, I believe), which,
unfortunately, fails to recognize some of the Unicode subtleties.

I agree about not dumping on the past. When unicode support was added to re, it was a somewhat experimental advance over bytes-only re. Now that Python has spread to south Asia as well as east Asia, it is time to advance it further. I think this is especially important for 3.0, which will attract such users with the option of native identifiers. Re should be able to recognize Python identifiers as words. I care not whether the patch is called a fix or an update.

I have no personal need for this at the moment but it just happens that I studied Sanskrit a bit some years ago and understand the script and could explain why at least some 'marks' are really 'letters'. There are several other south Asian scripts descended from Devanagari, and included in Unicode, that use the same or similar vowel mark system. So updating Python's idea of a Unicode word will help users of several languages and make it more of a world language.

I presume that not viewing letter marks as part of words would affect Hebrew and Arabic also.

I wonder if the current rule also affect European words with accents written as separate marks instead of as part of combined characters. For instance, if Martin's last name is written 'L' 'o' 'diaresis mark' 'w' 'i' 's' (6 chars) instead of 'L' 'o with diaresis' 'w' 'i' 's' (5 chars), is it still recognized as a word? (I don't know how to do the input to do the test.)

I notice from the manual "All identifiers are converted into the normal form NFC while parsing; comparison of identifiers is based on NFC." If NFC used accented letters, then the issue is finesses away for European words simply because Unicode includes includes combined characters for European scripts but not for south Asian scripts.

Does that mean that the re module will need to convert both the pattern and the text to be searched into NFC form first? And I'm still not clear whether \w, when used on a string consisting of Lo followed by Mc, should match Lo and then Mc (one codepoint at a time) or together (one character at a time, where a character consists of some base character codepoint possibly followed by modifier codepoints).

I ask because I'm working on the re module at the moment.
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to