On Mon, 29 Jan 2018 07:16:04 +0100 Philippe Verdy via Unicode <[email protected]> wrote:
> 2018-01-28 23:44 GMT+01:00 Richard Wordingham via Unicode < > [email protected]>: > > In the search you have in mind, the converted regex for use with NFD > > strings is actually intelligible and simple: > > > > <LATIN SMALL LETTER A> > > [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * > > <COMBINING DOT BELOW> > > [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * > > <COMBINING CIRCUMFLEX> > > > > Informal notation can simplify the regex still further. > > > > There is no upper bound to the length of a string matching that > > regex, > > Wrong, you've not read what followed immediately that commented it > already: it IS bound exactly because you cannot duplicate the same > combining class, and there's a known finite number of them for > acceptable cases: if there's any repetition, it will always be within > that bound. Are you talking about regular expressions or strings that match them? Natural language text can very easily contain adjacent combining characters of the same combining class - look no further than the full decomposition of U+01D6 LATIN SMALL LETTER U WITH DIAERESIS AND MACRON. For a few combining characters, such as U+1A7F TAI THAM COMBINING CRYPTOGRAMMIC DOT, repetition is of their very essence. One can find pairs of combining circumflexes in plain text maths. Incidentally, I was talking about regular expressions, which imply *finite* state machines, albeit huge, rather then 'regexes', which are similar but may formally require unbounded memory. Richard.

