> On 11 Oct 2019, at 00:23, Markus Scherer via Unicode <unicode@unicode.org> > wrote: > > On Tue, Oct 8, 2019 at 7:28 AM Richard Wordingham via Unicode > <unicode@unicode.org> wrote: > An example UTS#18 gives for matching a literal cluster can be simplified > to, in its notation: > > [c \q{ch}] > > This is interpreted as 'match against "ch" if possible, otherwise > against "c". Thus the strings "ca" and "cha" would both match the > expression > > [c \q{ch}]a > > while "chh" but not "ch" would match against > > [c \q{ch}]h > > Right. We just independently discussed this today in the UTC meeting, > connected with the "properties of strings" discussion in the proposed update. > > [c \q{ch}]h should work like (ch|c)h. Note that the order matters in the > alternation -- so this works equivalently if longer strings are sorted first. > > May I correctly argue instead that matching against literal clusters > would be satisfied by instead supporting, for this example, the regular > subexpression "(c|ch)" or the UnicodeSet expression "[c{ch}]"? > > ICU UnicodeSet [c{ch}] is equivalent to UTS #18 [c\q{ch}]. > > ICU's UnicodeSet syntax is simpler, the UTS #18 syntax is more > backward-compatible.
Not quite following this discussion, but I got triggered by the use of Perl in this discussion. In Perl 6 (which is a different language from Perl 5 altogether), regular expressions have been completely revamped. In Perl 6, the use of "|" indicates alternatives using longest token matching (LTM): https://docs.perl6.org/language/regexes#index-entry-regex_|-Longest_alternation:_| In Perl 6, the use of "||" indicates first matching alternative wins: https://docs.perl6.org/language/regexes#index-entry-regex_||-Alternation:_|| Furthermore, Perl 6 uses Normalization Form Grapheme for matching: https://docs.perl6.org/type/Cool#index-entry-Grapheme Hope this has some relevance to this discussion / gives new viewpoints. Elizabeth Mattijsen