Tom Christiansen <tchr...@perl.com> added the comment: Martin v. Löwis <rep...@bugs.python.org> wrote on Sat, 01 Oct 2011 10:59:48 -0000:
>> * Word characters are Alphabetic + Mn+Mc+Me + Nd + Pc. > Where did you get that definition from? UTS#18 defines > "<word_character>", which is Alphabetic + U+200C + U+200D > (i.e. not including marks, but including those >From UTS#18 RL1.2A in Annex C, where a \p{word} or \w character is defined to be \p{alpha} \p{gc=Mark} \p{digit} \p{gc=Connector_Punctuation} >> I think you are looking for here are Word characters without >> Nd + Pc, so just Alphabetic + Mn+Mc+Me. >> >> Is that right? > > With your definition of "Word character" above, yes, that's right. It's not mine. It's tr18's. > Marks won't start a word, though. That's the smarter boundary thing they talk about. I'm not myself familiar with \pM > As for terminology: I think the documentation should continue to > speak about "words" and "letters", and then define what is meant > in this context. It's not that the Unicode consortium invented > the term "letter", so we should use it more liberally than just > referring to the L* categories. I really don't think it wise to have private definitions of these. If Letter doesn't mean L?, things get too weird. That's why there are separate definitions of alphabetic, word, etc. --tom ---------- title: str.title() is overzealous by upcasing combining marks inappropriately -> str.title() is overzealous by upcasing combining marks inappropriately _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue12737> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com