Note that for finding occurence of simpler combining sequences such as finding <LATIN SMALL LETTER A WITH CIRCUMFLEX> the regexp is simpler: <LATIN SMALL LETTER A> [[ [^[[:cc=0:]]] - [[:cc=above:]] ]] * <COMBINING CIRCUMFLEX>
The central character class allows 53 distinct combining classes, and the maximum match length is 2+53=55 characters. If Unicode assigns new combining classes for new combining characters, the maximum match length will increase by 1 character for this regexp, and by 2 characters in the previous example. As there can be at most 255 non-zero combining classes (due to current stability rules), finding <LATIN SMALL LETTER A WITH CIRCUMFLEX> will match at most 1+253+1 = 255 characters in any future version of Unicode, and finding <LATIN SMALL LETTER A WITH CIRCUMFLEX,DOT BELOW> will match at most 1+252+1+252+1 = 507 characters. This is still finite, small enough to be implementable with a deterministic FSM, using no more than 1 codepoint of lookahead, without using any backtrailing. 2018-01-28 20:30 GMT+01:00 Philippe Verdy <[email protected]>: > Typo, the full regexp has undesired asterisks: > > <LATIN SMALL LETTER A> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * > ( <COMBINING CIRCUMFLEX> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] > ]] * <COMBINING DOT BELOW> > | <COMBINING DOT BELOW> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] > ]] * < COMBINING CIRCUMFLEX> > > > > 2018-01-28 20:29 GMT+01:00 Philippe Verdy <[email protected]>: > >> >> >> 2018-01-28 5:12 GMT+01:00 Richard Wordingham via Unicode < >> [email protected]>: >> >>> On Sat, 27 Jan 2018 14:13:40 -0800The theory >>> of regular expressions (though you may not think that mathematical >>> regular expressions matter) extends to trace monoids, with the >>> disturbing exception that the Kleene star of a regular language is not >>> necessarily regular. (The prototypical example is sequences (xy)^n >>> where x and y are distinct and commute, i.e. xy and yx are canonically >>> equivalent in Unicode terms. A Unicode example is the set of strings >>> composed only of U+0F73 TIBETAN VOWEL SIGN II - there is no FSM that >>> will recognise canonically equivalent strings). >>> >> >> I don't see why you can't write this as the regular expression: >> (x | y)* >> For the Unicode canonical equivalences, this applies to distinct >> characters that have distinct non-zero combining classes. >> >> But of course searching for >> >> <LATIN SMALL LETTER A WITH CIRCUMFLEX, COMBINING DOT BELOW> or >> <LATIN SMALL LETTER A WITH DOT BELOW, COMBINING CIRCUMFLEX > >> >> requires transforming it to NFD first as: >> >> <LATIN SMALL LETTER A, COMBINING DOT BELOW, COMBINING CIRCUMFLEX> >> >> so thet the regexp transforms to: >> >> <LATIN SMALL LETTER A> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] >> * >> ( <COMBINING CIRCUMFLEX> * [[ [^[[:cc=0:]]] - >> [[:cc=above:][:cc=below:]] ]] * <COMBINING DOT BELOW> >> | <COMBINING DOT BELOW> * [[ [^[[:cc=0:]]] - >> [[:cc=above:][:cc=below:]] ]] * < COMBINING CIRCUMFLEX> >> >> Note that the "complex" set of characters used three times above is >> finite, it contains all combining characters of Unicode that have a >> non-zero combining class except above and below, i.e. all Unicode >> characters whose combining class is not 0, 220 (below) or 230 (above). >> >> However, It is too simplified, because the allowed combining classes must >> occur at most once in each possible non-zero combining class and not >> arbitrary numbers of them: these allowed combining classs currently are in >> {1, 7..36, 84, 91, 103, 107, 118, 122, 129, 130, 132, 202, 214, 216, 218, >> 222, 224, 226, 228, 232..234, 240} whose most member elements are used for >> very few combining characters (the above and below combining classes are >> the most populated ones but we exclude them, all the others have 1 to 9 >> combining characters assigned to them, or 22 characters with cc=7 (nukta), >> or 32 characters with cc=1 (overlay), or 47 characters with cc=9 (virama). >> >> Once again we can refine them also as a regexp, but this is combinatorial >> because we have 52 combining classes (so we would need to consider the 52! >> (factorial) alternates). But given the maximum length of what this can >> match (0 to 52 combining characters: yes it is finite), this is best done >> by not rewriting this as a regexp but by replacing the final "*" by {1,52}, >> and then just check each returned match of >> >> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]]{0,52} >> >> with a simple scan of these short strings to see that they all have >> distinct combining classes (this just requires 52 booleans, easily stored >> in a single 64 bit integer initialized to 0 prior to scan the scan of these >> small strings). But the theory does not prevent writing it as a regexp >> (even if it would be extremely long). So a Kleene Star closure is >> possible and can be efficiently implemented (all depends on the way you >> represent the "current state" in the FSM: a single integer representing a >> single node number in the traversal graph is not the best way to do that. >> >> This is a valid regexp, the finite state machine DOES have a finite >> lookahead (the full regexp above will match AT MOST 107 characters >> including the combining marks, where 107=3+2*52), but this is general to >> regexps that generally cannot be transformed directly into a FSM with >> finite lookahead, but a FSM is possible: the regexp first transforms into a >> simple graph of transitions with a finite number of node (this number is >> bound to the length of the regexp itself) where there can be multiple >> states active simultaneously; then a basic transform converts this small >> graph by transforming nodes into new nodes representing the finite set of >> the combinations of active states in the first graph : >> >> There will be many more nodes, and generally this explodes in size >> because the transform is combinatorial, and such size explosion has worst >> perfomance (explosion of the memory needed to representing the new graph >> with a single state active). So regexp engines use the alternative by >> representing the current state of traversal of the first simple graph using >> a stack of active states and transiting them all separately (this can be >> implemented with a "bitset" whose size in bits is the number of states in >> the first simple graph, or by using an associative array (dictionnary of >> boolean properties whose keys are state numbers in the first graph, which >> can be set or removed: this requires much less memory and it is relatively >> fast, even if the full current state is not just a single small integer. >> >> >

