Re: Pure Regular Expression Engines and Literal Clusters

Richard Wordingham via Unicode Sat, 12 Oct 2019 15:08:03 -0700

On Fri, 11 Oct 2019 12:39:56 +0200
Elizabeth Mattijsen via Unicode <[email protected]> wrote:



> Furthermore, Perl 6 uses Normalization Form Grapheme for matching:
>     https://docs.perl6.org/type/Cool#index-entry-Grapheme

This approach does address the issue Mark Davis mentioned about regex
engines working at the wrong level.  Perhaps you can put my mind at
rest about whether it works at all with scripts that subordinate
vowels.

If I wanted to find the occurrences of the Pali word _pacati_ 'to cook'
in Latin script text using form NFG, I could use a Perl regular
expression like /\b(:?a|pa)?p[aā]c(:?\B.)*/.  (At least,

grep -P '\b(:?a|pa)?p[aā]c\p{Ll}*' file.txt

works on text in NFC.  I couldn't work out the command-line expression
to display a list of matches from Perl, and the PCRE \B is broken beyond
ASCII in GNU grep 2.25.)

How would I do such a search in an Indic script using form NFG?

The main issue is that the single character 'c' would have to expand to
a list of all but one of the Pali grapheme clusters whose initial
consonant transliterates to 'c'.  Have you a notation for such a class?

Regards,

Richard.

Re: Pure Regular Expression Engines and Literal Clusters

Reply via email to