On Fri, 11 Oct 2019 14:35:33 -0700 Markus Scherer via Unicode <unicode@unicode.org> wrote:
> > > [c \q{ch}]h should work like (ch|c)h. Note that the order matters > > > in the alternation -- so this works equivalently if longer > > > strings are sorted first. > > Does conformance UTS#18 to level 2 mandate the choice of matching > > substring? This would appear to prohibit compliance to POSIX rules, > > where the length of overall match counts. > The idea is currently to specify properties-of-strings (and I think a > range/class with "clusters") behaving like an alternation where the > longest strings are first, and leaving it up to the regex engine > exactly what that means. > > In general, UTS #18 offers a lot of things that regex implementers > may or may not adopt. > If you have specific ideas, please send them as PRI feedback. > (Discussion on the list is good and useful, but does not guarantee > that it gets looked at when it counts.) You claimed the order of alternatives mattered. That is an important issue for anyone rash enough to think that the standard is fit to be used as a specification. I'm still not entirely clear what a regular expression /[\u00c1\u00e1]/ can mean. If the system uses NFD to simulate Unicode conformance, shall the expression then be converted to /[{A\u0301}{a\u0301}]/? Or should it simply fail to match any NFD string? I've been implementing the view that all or none of the canonical equivalents of a string match. (I therefore support mildly discontiguous substrings, though I don't support splitting undecomposable characters.) Richard.