On 10/13/2019 6:38 PM, Richard
Wordingham via Unicode wrote:
On Sun, 13 Oct 2019 17:13:28 -0700 Asmus Freytag via Unicode <unicode@unicode.org> wrote:On 10/13/2019 2:54 PM, Richard Wordingham via Unicode wrote: Besides invalidating complexity metrics, the issue was what \p{Lu} should match. For example, with PCRE syntax, GNU grep Version 2.25 \p{Lu} matches U+0100 but not <A, U+0300>. When I'm respecting canonical equivalence, I want both to match [:Lu:], and that's what I do. [:Lu:] can then match a sequence of up to 4 NFD characters. Formally, wouldn't that be rewriting \p{Lu} to match \p{Lu}\p{Mn}*; instead of formally handling NFD, you could extend the syntax to handle "inherited" properties across combining sequences. Am I missing anything?Yes. There is no precomposed LATIN LETTER M WITH CIRCUMFLEX, so [:Lu:] should not match <U+004D LATIN CAPITAL LETTER M, U+0302 COMBINING CIRCUMFLEX ACCENT>. Why does it matter if it is precomposed? Why should it? (For
anyone other than a character coding maven). Now, I could invent a string property so that \p{xLu} that meant (:?\p{Lu}\p{Mn}*). I don't entirely understand what you said; you may have missed the distinction between "[:Lu:] can then match" and "[:Lu:] will then match". I think only Greek letters expand to 4 characters in NFD. When I'm respecting canonical equivalence/working with traces, I want [:insc=vowel_dependent:][:insc=tone_mark:] to match both <U+0E39 THAI CHARACTER SARA UU, U+0E49 THAI CHARACTER MAI THO> and its canonical equivalent <U+0E49, U+0E39>. The canonical closure of that sequence can be messy even within scripts. Some pairs commute: others don't, usually for good reasons. Some models may be more natural for different scripts. Certainly, in SEA or Indic scripts, most combining marks are not best modeled with properties as "inherited". But for L/G/C etc. it would be a different matter. For general recommendations, such as UTS#18, it would be good to move the state of the art so that the "primitives" are in line with the way typical writing systems behave, so that people can write "linguistically correct" regexes. A./
Regards, Richard.
|
- Re: Pure Regular Expression Engines and Lit... Richard Wordingham via Unicode
- Re: Pure Regular Expression Engines and Lit... Hans Åberg via Unicode
- Re: Pure Regular Expression Engines and Lit... Richard Wordingham via Unicode
- Re: Pure Regular Expression Engines and Lit... Hans Åberg via Unicode
- Re: Pure Regular Expression Engines and Lit... Richard Wordingham via Unicode
- Re: Pure Regular Expression Engines and Lit... Hans Åberg via Unicode
- Re: Pure Regular Expression Engines and Lit... Richard Wordingham via Unicode
- Re: Pure Regular Expression Engines and Lit... Eli Zaretskii via Unicode
- Re: Pure Regular Expression Engines and Lit... Asmus Freytag via Unicode
- Re: Pure Regular Expression Engines and Lit... Richard Wordingham via Unicode
- Re: Pure Regular Expression Engines and Lit... Asmus Freytag via Unicode
- Re: Pure Regular Expression Engines and Lit... Mark Davis ☕️ via Unicode
- Re: Pure Regular Expression Engines and Lit... Richard Wordingham via Unicode
- Re: Pure Regular Expression Engines and Lit... Richard Wordingham via Unicode