Re: Pure Regular Expression Engines and Literal Clusters

Asmus Freytag via Unicode Sun, 13 Oct 2019 20:30:08 -0700

On 10/13/2019 6:38 PM, Richard Wordingham via Unicode wrote:

On Sun, 13 Oct 2019 17:13:28 -0700
Asmus Freytag via Unicode <[email protected]> wrote:

On 10/13/2019 2:54 PM, Richard Wordingham via Unicode wrote:
Besides invalidating complexity metrics, the issue was what \p{Lu}
should match.  For example, with PCRE syntax, GNU grep Version 2.25
\p{Lu} matches U+0100 but not <A, U+0300>.  When I'm respecting
canonical equivalence, I want both to match [:Lu:], and that's what I
do. [:Lu:] can then match a sequence of up to 4 NFD characters.

Formally, wouldn't that be rewriting \p{Lu} to match \p{Lu}\p{Mn}*;
instead of formally handling NFD, you could extend the syntax to
handle "inherited" properties across combining sequences.

Am I missing anything?

Yes.  There is no precomposed LATIN LETTER M WITH CIRCUMFLEX, so [:Lu:]
should not match <U+004D LATIN CAPITAL LETTER M, U+0302 COMBINING
CIRCUMFLEX ACCENT>.

Why does it matter if it is precomposed? Why should it? (For anyone other than a character coding maven).

 Now, I could invent a string property so
that \p{xLu} that meant (:?\p{Lu}\p{Mn}*).

I don't entirely understand what you said; you may have missed the
distinction between "[:Lu:] can then match" and "[:Lu:] will then
match".  I think only Greek letters expand to 4 characters in NFD.

When I'm respecting canonical equivalence/working with traces, I want
[:insc=vowel_dependent:][:insc=tone_mark:] to match both <U+0E39 THAI
CHARACTER SARA UU, U+0E49 THAI CHARACTER MAI THO> and its canonical
equivalent <U+0E49, U+0E39>.  The canonical closure of that
sequence can be messy even within scripts.  Some pairs commute: others
don't, usually for good reasons.

Some models may be more natural for different scripts. Certainly, in SEA or Indic scripts, most combining marks are not best modeled with properties as "inherited". But for L/G/C etc. it would be a different matter.

For general recommendations, such as UTS#18, it would be good to move the state of the art so that the "primitives" are in line with the way typical writing systems behave, so that people can write "linguistically correct" regexes.

A./

Regards,

Richard.

Re: Pure Regular Expression Engines and Literal Clusters

Reply via email to