Typo, the full regexp has undesired asterisks: <LATIN SMALL LETTER A> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * ( <COMBINING CIRCUMFLEX> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * <COMBINING DOT BELOW> | <COMBINING DOT BELOW> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * < COMBINING CIRCUMFLEX>
2018-01-28 20:29 GMT+01:00 Philippe Verdy <verd...@wanadoo.fr>: > > > 2018-01-28 5:12 GMT+01:00 Richard Wordingham via Unicode < > unicode@unicode.org>: > >> On Sat, 27 Jan 2018 14:13:40 -0800The theory >> of regular expressions (though you may not think that mathematical >> regular expressions matter) extends to trace monoids, with the >> disturbing exception that the Kleene star of a regular language is not >> necessarily regular. (The prototypical example is sequences (xy)^n >> where x and y are distinct and commute, i.e. xy and yx are canonically >> equivalent in Unicode terms. A Unicode example is the set of strings >> composed only of U+0F73 TIBETAN VOWEL SIGN II - there is no FSM that >> will recognise canonically equivalent strings). >> > > I don't see why you can't write this as the regular expression: > (x | y)* > For the Unicode canonical equivalences, this applies to distinct > characters that have distinct non-zero combining classes. > > But of course searching for > > <LATIN SMALL LETTER A WITH CIRCUMFLEX, COMBINING DOT BELOW> or > <LATIN SMALL LETTER A WITH DOT BELOW, COMBINING CIRCUMFLEX > > > requires transforming it to NFD first as: > > <LATIN SMALL LETTER A, COMBINING DOT BELOW, COMBINING CIRCUMFLEX> > > so thet the regexp transforms to: > > <LATIN SMALL LETTER A> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * > ( <COMBINING CIRCUMFLEX> * [[ [^[[:cc=0:]]] - > [[:cc=above:][:cc=below:]] ]] * <COMBINING DOT BELOW> > | <COMBINING DOT BELOW> * [[ [^[[:cc=0:]]] - > [[:cc=above:][:cc=below:]] ]] * < COMBINING CIRCUMFLEX> > > Note that the "complex" set of characters used three times above is > finite, it contains all combining characters of Unicode that have a > non-zero combining class except above and below, i.e. all Unicode > characters whose combining class is not 0, 220 (below) or 230 (above). > > However, It is too simplified, because the allowed combining classes must > occur at most once in each possible non-zero combining class and not > arbitrary numbers of them: these allowed combining classs currently are in > {1, 7..36, 84, 91, 103, 107, 118, 122, 129, 130, 132, 202, 214, 216, 218, > 222, 224, 226, 228, 232..234, 240} whose most member elements are used for > very few combining characters (the above and below combining classes are > the most populated ones but we exclude them, all the others have 1 to 9 > combining characters assigned to them, or 22 characters with cc=7 (nukta), > or 32 characters with cc=1 (overlay), or 47 characters with cc=9 (virama). > > Once again we can refine them also as a regexp, but this is combinatorial > because we have 52 combining classes (so we would need to consider the 52! > (factorial) alternates). But given the maximum length of what this can > match (0 to 52 combining characters: yes it is finite), this is best done > by not rewriting this as a regexp but by replacing the final "*" by {1,52}, > and then just check each returned match of > > [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]]{0,52} > > with a simple scan of these short strings to see that they all have > distinct combining classes (this just requires 52 booleans, easily stored > in a single 64 bit integer initialized to 0 prior to scan the scan of these > small strings). But the theory does not prevent writing it as a regexp > (even if it would be extremely long). So a Kleene Star closure is > possible and can be efficiently implemented (all depends on the way you > represent the "current state" in the FSM: a single integer representing a > single node number in the traversal graph is not the best way to do that. > > This is a valid regexp, the finite state machine DOES have a finite > lookahead (the full regexp above will match AT MOST 107 characters > including the combining marks, where 107=3+2*52), but this is general to > regexps that generally cannot be transformed directly into a FSM with > finite lookahead, but a FSM is possible: the regexp first transforms into a > simple graph of transitions with a finite number of node (this number is > bound to the length of the regexp itself) where there can be multiple > states active simultaneously; then a basic transform converts this small > graph by transforming nodes into new nodes representing the finite set of > the combinations of active states in the first graph : > > There will be many more nodes, and generally this explodes in size because > the transform is combinatorial, and such size explosion has worst > perfomance (explosion of the memory needed to representing the new graph > with a single state active). So regexp engines use the alternative by > representing the current state of traversal of the first simple graph using > a stack of active states and transiting them all separately (this can be > implemented with a "bitset" whose size in bits is the number of states in > the first simple graph, or by using an associative array (dictionnary of > boolean properties whose keys are state numbers in the first graph, which > can be set or removed: this requires much less memory and it is relatively > fast, even if the full current state is not just a single small integer. > >