Philippe and I have got bogged down in a long discussion of how to parse traces of Unicode strings under canonical equivalence against non-regular Kleene star of regular expressions. Fortunately, such expressions can be expected to have very little use. A seemingly simple example is the regex \u0f73* i.e. any number of occurrences of U+0F73 TIBETAN VOWEL SIGN II, and not \u0f71\u0f72*. An example of a string matching under canonical equivalence is 0F71 0F71 0F72 0F72.
I believe we both thought that characters would arrive from the trace in a deterministic order. Now, many regular expression engines back-track their parsing of the input string (no-one has reported working with input traces). A possibly useful trick would be for characters to be taken from the input file in accordance with the matching to the pattern, with input also back-tracked if matching fails. The notion of next character would depend on the state of the parsing algorithm. In the example above, the engine would just take the input in the order 0F71 0F72 0F71 0F72. Match found, job done. One advantage of this scheme is that there would be no need for adjustments to deal with the interleaving of adjacent matches to successive subexpressions. There would be no nagging worry that one's rational expression was not a regular expression when applied to traces. Any theoreticians around may be wondering how this magic is achieved. The simple answer is that the non-finiteness has been transferred to: (1) the back-tracking through parse options; and (2) the algorithm to walk through the character sequencing options. The algorithm itself should be tractable - Mark Davis has published an algorithm to generate all strings canonically equivalent to a Unicode string, and what we need might not be so complex. I offer this thought up as it seems that, for a regex engine working on traces with deterministic input, the byte code for a regex concatenation AB or iteration A* is much more complicated than the code for the subregexes A and B. I have a worry that the length of the compiled code might even be exponential with the length of the regex. (I may be wrong - there might be a limit to what one can do for worst case complexity of the interleaving.) Choosing the input to match the regex would remove this problem. Richard.

