On Wed, 19 Jul 2017 12:09 am, Random832 wrote: > On Fri, Jul 14, 2017, at 08:33, Chris Angelico wrote: >> What do you mean about regular expressions? You can use REs with >> normalized strings. And if you have any valid definition of "real >> character", you can use it equally on an NFC-normalized or >> NFD-normalized string than any other. They're just strings, you know. > > I don't understand how normalization is supposed to help with this. It's > not like there aren't valid combinations that do not have a > corresponding single NFC codepoint (to say nothing of the situation with > e.g. Indic languages).
Normalisation helps. Suppose you want to search for é for example, a naive regular expression engine will only find the exact representation you or your editor happened to use: U+00E9 LATIN SMALL LETTER E WITH ACUTE or U+0065 LATIN SMALL LETTER E + U+0301 COMBINING ACUTE ACCENT but not both. By normalising, you ensure that both the text you are searching and the regex you are searching for are in the same state: either composed to a single code point U+00E9 or decomposed to two U+0065,0301 but never one in one state and the other in the other. For characters that don't include a canonical composition form, then there's no problem: you will always be searching for a decomposed character using a base character followed by combining characters, so there is no discrepancy and it will just work. > In principle probably a viable solution for regex would be to add > character classes for base and combining characters, and then > "[[:base:]][[:combining:]]*" can be used as a building block if > necessary. I don't know what that means. Any code point (except for combining characters themselves) can be used as the base, and the various kinds of combining characters have the Unicode category property: Mn (Mark, nonspacing) Mc (Mark, spacing combining) Me (Mark, enclosing) If we're talking about combining accents and diacritics, the one we want is Mc. But generally, we're not after "any old diacritic", we're after a specific one, on a specific base. -- Steve “Cheer up,” they said, “things could be worse.” So I cheered up, and sure enough, things got worse. -- https://mail.python.org/mailman/listinfo/python-list