On Fri, 11 Oct 2019 12:39:56 +0200 Elizabeth Mattijsen via Unicode <unicode@unicode.org> wrote:
> Furthermore, Perl 6 uses Normalization Form Grapheme for matching: > https://docs.perl6.org/type/Cool#index-entry-Grapheme This approach does address the issue Mark Davis mentioned about regex engines working at the wrong level. Perhaps you can put my mind at rest about whether it works at all with scripts that subordinate vowels. If I wanted to find the occurrences of the Pali word _pacati_ 'to cook' in Latin script text using form NFG, I could use a Perl regular expression like /\b(:?a|pa)?p[aā]c(:?\B.)*/. (At least, grep -P '\b(:?a|pa)?p[aā]c\p{Ll}*' file.txt works on text in NFC. I couldn't work out the command-line expression to display a list of matches from Perl, and the PCRE \B is broken beyond ASCII in GNU grep 2.25.) How would I do such a search in an Indic script using form NFG? The main issue is that the single character 'c' would have to expand to a list of all but one of the Pali grapheme clusters whose initial consonant transliterates to 'c'. Have you a notation for such a class? Regards, Richard.