Re: Pure Regular Expression Engines and Literal Clusters

Elizabeth Mattijsen via Unicode Fri, 11 Oct 2019 03:44:58 -0700

> On 11 Oct 2019, at 00:23, Markus Scherer via Unicode <[email protected]> 
> wrote:
> 
> On Tue, Oct 8, 2019 at 7:28 AM Richard Wordingham via Unicode 
> <[email protected]> wrote:
> An example UTS#18 gives for matching a literal cluster can be simplified
> to, in its notation:
> 
> [c \q{ch}]
> 
> This is interpreted as 'match against "ch" if possible, otherwise
> against "c".  Thus the strings "ca" and "cha" would both match the
> expression
> 
> [c \q{ch}]a
> 
> while "chh" but not "ch" would match against
> 
> [c \q{ch}]h
> 
> Right. We just independently discussed this today in the UTC meeting, 
> connected with the "properties of strings" discussion in the proposed update.
> 
> [c \q{ch}]h should work like (ch|c)h. Note that the order matters in the 
> alternation -- so this works equivalently if longer strings are sorted first.
> 
> May I correctly argue instead that matching against literal clusters
> would be satisfied by instead supporting, for this example, the regular
> subexpression "(c|ch)" or the UnicodeSet expression "[c{ch}]"?
> 
> ICU UnicodeSet [c{ch}] is equivalent to UTS #18 [c\q{ch}].
> 
> ICU's UnicodeSet syntax is simpler, the UTS #18 syntax is more 
> backward-compatible.


Not quite following this discussion, but I got triggered by the use of Perl in 
this discussion.

In Perl 6 (which is a different language from Perl 5 altogether), regular 
expressions have been completely revamped.

In Perl 6, the use of "|" indicates alternatives using longest token matching 
(LTM):
   
https://docs.perl6.org/language/regexes#index-entry-regex_|-Longest_alternation:_|

In Perl 6, the use of "||" indicates first matching alternative wins:
    https://docs.perl6.org/language/regexes#index-entry-regex_||-Alternation:_||

Furthermore, Perl 6 uses Normalization Form Grapheme for matching:
    https://docs.perl6.org/type/Cool#index-entry-Grapheme

Hope this has some relevance to this discussion / gives new viewpoints.



Elizabeth Mattijsen

Re: Pure Regular Expression Engines and Literal Clusters

Reply via email to