2015-05-15 9:10 GMT+02:00 Richard Wordingham < [email protected]>:
> On Fri, 15 May 2015 02:10:36 +0200 > Philippe Verdy <[email protected]> wrote: > > > 2015-05-14 20:13 GMT+02:00 Richard Wordingham < > > [email protected]>: > > > > > On Thu, 14 May 2015 12:58:29 +0200 > > > Philippe Verdy <[email protected]> wrote: > > > > > > > 2015-05-14 9:59 GMT+02:00 Richard Wordingham < > > > > [email protected]>: > > > > > > > > > An elegant formal solution to the Kleene star problem interprets > > > > > (\u0323\u0302)* as (\u0323|\u0302)*. However, that is > > > > > counter-intuitive > > > > > > The technical term for this is the 'concurrent iteration' - or at > > > least, that's the term used in the 'Book of Traces'. > > > > > > > For your example "(\u0323\u0302)*" the characters in the > > > > alternatives (COMBINING DOT BELOW and COMBINING ACUTE ACCENT), > > > > once converted to NFD (which is the same here) are just using at > > > > most two distinct non-zero combining classes and no blocker; so > > > > it is safe to trasform it to (\u0323|\u0302)* for a first pass > > > > matching that will then only check candidate matchs in the second > > > > pass. or more efficiently, a second finite state automata (FSA) > > > > running in parallel with its own state: > > > > > > You've forgotten the basic problem. A *finite* state automaton > > > cannot count very far; with only n states, it cannot count as far > > > as n. > > > > > > > I did not forget it, this is why there's a second pass (or a second > > FSA running in parallel to indicate its own accept state). You have > > to combine the two states variables to get the final combined state > > to determine if it is a final accept state. > > Your description makes no sense to me as a description of a finite > state automaton. This is because you don't understand the issue ! > Now, a program to check whether a trace matching > {\u0323|\u0302)* matches (\u0323\u0302)* is very simple. It just > counts the number of times \u0323 occurs and the number of times > \u0302 occurs, and returns whether they are equal. This is wrong. \0323\0323\0302\0302 and \0323\0302\0323\0302 would pass your counting test (which does not work in a FSA) but they are NOT canonically equivalent because the identical combining characters are blocking each other (so arbitrary ordering is not possible). I maintain what I said: you don't need arbitrary counting and a FSA is possible (both NFA, using compound states, and the derived DFA if ever you want to resolve compound states to a single integer, but assume the fact the the transition tables will explode dramatically) Once again we cannot have pairs of strings where you cannot isolate BOUNDED substrings (between blockers) where you can check their canonically equivalence. At most you'll have only 255 combining characters to check that have distinct non-zero combining classes. So the FSA implementation is perfectly possible, for canonical equivalences only... This evidently does not work if you are performing regexp searches using looser equivalences, such as compatibility equivalence.

