On Mon, 14 Oct 2019 00:22:36 +0200 Hans Åberg via Unicode <unicode@unicode.org> wrote:
> > On 13 Oct 2019, at 23:54, Richard Wordingham via Unicode > > <unicode@unicode.org> wrote: >> Besides invalidating complexity metrics, the issue was what \p{Lu} >> should match. For example, with PCRE syntax, GNU grep Version 2.25 >> \p{Lu} matches U+0100 but not <A, U+0300>. When I'm respecting >> canonical equivalence, I want both to match [:Lu:], and that's what >> I do. [:Lu:] can then match a sequence of up to 4 NFD characters. > Hopefully some experts here can tune in, explaining exactly what > regular expressions they have in mind. The best indication lies at https://www.unicode.org/reports/tr18/tr18-13.html#Canonical_Equivalents (2008), which is the last version before support for canonical equivalence was dropped as a requirement. It's not entirely coherent, as the authors don't seem to find an expression like \p{L}\p{gcb=extend}* a natural thing to use, as the second factor is mostly sequences of non-starters. At that point, I would say they weren't expecting \p{Lu} to not match <A, U+0300>, as they were still expecting [ä] to match both "ä" and "a\u0308". They hadn't given any thought to [\p{L}&&\p{isNFD}]\p{gcb=extend}*, and were expecting normalisation (even to NFC) to be a possible cure. They had begun to realise that converting expressions to match all or none of a set of canonical equivalents was hard; the issue of non-contiguous matches wasn't mentioned. When I say 'hard', I'm thinking of the problem that concatenation may require dissolution of the two constituent expressions and involve the temporary creation of 54-fold (if text is handled as NFD) or 2^54-fold (no normalisation) sets of extra states. That's what's driven me to write my own regular expression engine for traces. Regards, Richard.