> Date: Mon, 14 Oct 2019 01:10:45 +0100 > From: Richard Wordingham via Unicode <unicode@unicode.org> > > >> Besides invalidating complexity metrics, the issue was what \p{Lu} > >> should match. For example, with PCRE syntax, GNU grep Version 2.25 > >> \p{Lu} matches U+0100 but not <A, U+0300>. When I'm respecting > >> canonical equivalence, I want both to match [:Lu:], and that's what > >> I do. [:Lu:] can then match a sequence of up to 4 NFD characters. > > > Hopefully some experts here can tune in, explaining exactly what > > regular expressions they have in mind. > > The best indication lies at > https://www.unicode.org/reports/tr18/tr18-13.html#Canonical_Equivalents > (2008), which is the last version before support for canonical > equivalence was dropped as a requirement. > > It's not entirely coherent, as the authors don't seem to find an > expression like > > \p{L}\p{gcb=extend}* > > a natural thing to use, as the second factor is mostly sequences of > non-starters. At that point, I would say they weren't expecting > \p{Lu} to not match <A, U+0300>, as they were still expecting [ä] to > match both "ä" and "a\u0308". > > They hadn't given any thought to [\p{L}&&\p{isNFD}]\p{gcb=extend}*, and > were expecting normalisation (even to NFC) to be a possible cure. They > had begun to realise that converting expressions to match all or none > of a set of canonical equivalents was hard; the issue of non-contiguous > matches wasn't mentioned.
I think these are two separate issues: whether search should normalize (a.k.a. performs character folding) should be a user option. You are talking only about canonical equivalence, but there's also compatibility decomposition, so, for example, searching for "1" should perhaps match ¹ and ①.