On Mon, 29 Jan 2018 14:15:04 +0100 Philippe Verdy via Unicode <unicode@unicode.org> wrote:
> No since the begining we were talking about matching strings that are > canonically equivalent within regexps. So that searching for a regexp > containing precombined characters or decomposed characters would find > them independantly of the encoded form (normalized or not) of the > input and independantly that there are addtional combining characters > inserted between them. OK, we're taking different approaches. Given finite automata for recognising NFD strings matching regular languages A and B of traces, I know how to construct a non-deterministic finite automaton for recognising any of AB, A and B, A or B, and, if itself a regular language, A*, where the sets denoted are traces. For AB, the states I keep track of are states of A, states of B, and A × 255 × B, where the 2nd coordinate of the latter is the ccc of the latest element of the searched string used to propagate a state of B. If I didn't normalise the searched string, I'd have to keep a list of ccc's of characters used in propagating states of B. That gets complicated with A*, for which in theory I need the simultaneous progression of 255 FSMs (OK, only about 52 at the moment). I actually treat A* as A(A*), where no capture is implied by the parentheses. Theory says that if NFD([A]) is a regular language (where [x] is the set of strings in the canonical equivalence class x), then A is a regular language. However, constructing a finite automaton to check for matches may not be straightforward. I prefer to sacrifice the purity of finiteness by buffering enough of the searched string to convert it to NFD on the fly. As an example of the complexity, consider checking whether a string was composed of pairs of identical combining characters. > The case of u with diaeresis and macron is simpler: it has two > combining characters of the same combining class and they don't > commute, still the regexp to match it is something like: > > U [[:cc>0:]-[:cc=above:]]* <DIAERESIS> [[:cc>0:]-[:cc=above:]]* > <MACRON> [[:cc>0:]-[:cc=above:]]* <U WITH DIAERESIS AND MACRON> was meant to be an example of a searched string. For example, <U WITH DIARESIS AND MACRON, COMBINING DOT BELOW> contains, under canonical equivalence, the substring <U WITH DIAERESIS, COMBINING DOT BELOW>. Your regular expressions would not detect this relationship. Richard.