On Nov 30, 4:33 am, Terry Reedy <[EMAIL PROTECTED]> wrote: > Martin v. Löwis wrote: > > To be fair to Python (and SRE),
I was being unfair? In the context, "bug" == "needs to be changed"; see below. > SRE predates TR#18 (IIRC) - atleast > > annex C was added somewhere between revision 6 and 9, i.e. in early > > 2004. Python's current definition of \w is a straight-forward extension > > of the historical \w definition (of Perl, I believe), which, > > unfortunately, fails to recognize some of the Unicode subtleties. > > I agree about not dumping on the past. Dumping on the past?? I used the term "bug" in the same sense as you did: "I suggest that OP (original poster) Shiao file a bug report at http://bugs.python.org". > When unicode support was added > to re, it was a somewhat experimental advance over bytes-only re. Now > that Python has spread to south Asia as well as east Asia, it is time to > advance it further. I think this is especially important for 3.0, which > will attract such users with the option of native identifiers. Re > should be able to recognize Python identifiers as words. I care not > whether the patch is called a fix or an update. > > I have no personal need for this at the moment but it just happens that > I studied Sanskrit a bit some years ago and understand the script and > could explain why at least some 'marks' are really 'letters'. There are > several other south Asian scripts descended from Devanagari, and > included in Unicode, that use the same or similar vowel mark system. So > updating Python's idea of a Unicode word will help users of several > languages and make it more of a world language. > > I presume that not viewing letter marks as part of words would affect > Hebrew and Arabic also. > > I wonder if the current rule also affect European words with accents > written as separate marks instead of as part of combined characters. > For instance, if Martin's last name is written 'L' 'o' 'diaresis mark' > 'w' 'i' 's' (6 chars) instead of 'L' 'o with diaresis' 'w' 'i' 's' (5 > chars), is it still recognized as a word? (I don't know how to do the > input to do the test.) Like this: | >>> w1 = u"L\N{LATIN SMALL LETTER O WITH DIAERESIS}wis" | >>> w2 = u"Lo\N{COMBINING DIAERESIS}wis" | >>> w1 | u'L\xf6wis' | >>> w2 | u'Lo\u0308wis' | >>> import unicodedats as ucd | >>> ucd.category(u'\u0308') | 'Mn' | >>> u'\u0308'.isalpha() | False | >>> regex = re.compile(ur'\w+', re.UNICODE) | >>> regex.match(w1).group(0) | u'L\xf6wis' | >>> regex.match(w2).group(0) | u'Lo' -- http://mail.python.org/mailman/listinfo/python-list