The problem is that most regex engines are not written to handle some "interesting" features of canonical equivalence, like discontinuity. Suppose that X is canonically equivalent to AB.
- A query /X/ can match the separated A and C in the target string "AbC". So if I have code do [replace /X/ in "AbC" by "pq"], how should it behave? "pqb", "pbq", "bpq"? If the input was in NFD (for example), should the output be rearranged/decomposed so that it is NFD? and so on. - A query /A/ can match *part* of the X in the target string "aXb". So if I have code to do [replace /A/ in "aXb" by "pq"], what should result: "apqBb"? The syntax and APIs for regex engines are not built to handle these features. It introduces a enough complications in the code, syntax, and semantics that no major implementation has seen fit to do it. We used to have a section in the spec about this, but were convinced that it was better off handled at a higher level. Mark On Sun, Oct 13, 2019 at 8:31 PM Asmus Freytag via Unicode < unicode@unicode.org> wrote: > On 10/13/2019 6:38 PM, Richard Wordingham via Unicode wrote: > > On Sun, 13 Oct 2019 17:13:28 -0700 > Asmus Freytag via Unicode <unicode@unicode.org> <unicode@unicode.org> wrote: > > > On 10/13/2019 2:54 PM, Richard Wordingham via Unicode wrote: > Besides invalidating complexity metrics, the issue was what \p{Lu} > should match. For example, with PCRE syntax, GNU grep Version 2.25 > \p{Lu} matches U+0100 but not <A, U+0300>. When I'm respecting > canonical equivalence, I want both to match [:Lu:], and that's what I > do. [:Lu:] can then match a sequence of up to 4 NFD characters. > > Formally, wouldn't that be rewriting \p{Lu} to match \p{Lu}\p{Mn}*; > instead of formally handling NFD, you could extend the syntax to > handle "inherited" properties across combining sequences. > > Am I missing anything? > > Yes. There is no precomposed LATIN LETTER M WITH CIRCUMFLEX, so [:Lu:] > should not match <U+004D LATIN CAPITAL LETTER M, U+0302 COMBINING > CIRCUMFLEX ACCENT>. > > Why does it matter if it is precomposed? Why should it? (For anyone other > than a character coding maven). > > Now, I could invent a string property so > that \p{xLu} that meant (:?\p{Lu}\p{Mn}*). > > I don't entirely understand what you said; you may have missed the > distinction between "[:Lu:] can then match" and "[:Lu:] will then > match". I think only Greek letters expand to 4 characters in NFD. > > When I'm respecting canonical equivalence/working with traces, I want > [:insc=vowel_dependent:][:insc=tone_mark:] to match both <U+0E39 THAI > CHARACTER SARA UU, U+0E49 THAI CHARACTER MAI THO> and its canonical > equivalent <U+0E49, U+0E39>. The canonical closure of that > sequence can be messy even within scripts. Some pairs commute: others > don't, usually for good reasons. > > Some models may be more natural for different scripts. Certainly, in SEA > or Indic scripts, most combining marks are not best modeled with properties > as "inherited". But for L/G/C etc. it would be a different matter. > > For general recommendations, such as UTS#18, it would be good to move the > state of the art so that the "primitives" are in line with the way typical > writing systems behave, so that people can write "linguistically correct" > regexes. > > A./ > > > Regards, > > Richard. > > > >