2015-05-14 20:13 GMT+02:00 Richard Wordingham < [email protected]>:
> On Thu, 14 May 2015 12:58:29 +0200 > Philippe Verdy <[email protected]> wrote: > > > 2015-05-14 9:59 GMT+02:00 Richard Wordingham < > > [email protected]>: > > > > > An elegant formal solution to the Kleene star problem interprets > > > (\u0323\u0302)* as (\u0323|\u0302)*. However, that is > > > counter-intuitive > > The technical term for this is the 'concurrent iteration' - or at > least, that's the term used in the 'Book of Traces'. > > > For your example "(\u0323\u0302)*" the characters in the alternatives > > (COMBINING DOT BELOW and COMBINING ACUTE ACCENT), once converted to > > NFD (which is the same here) are just using at most two distinct > > non-zero combining classes and no blocker; so it is safe to trasform > > it to (\u0323|\u0302)* for a first pass matching that will then only > > check candidate matchs in the second pass. or more efficiently, a > > second finite state automata (FSA) running in parallel with its own > > state: > > You've forgotten the basic problem. A *finite* state automaton cannot > count very far; with only n states, it cannot count as far as n. > I did not forget it, this is why there's a second pass (or a second FSA running in parallel to indicate its own accept state). You have to combine the two states variables to get the final combined state to determine if it is a final accept state. But one of the two state variable has an upper bound which is not only finite but very small (it has at most 255 possible values). Typical regexp engines do not create the full deterministic automata with all its states (it would require a lot of memory due to combinatorial effects, they handle multiple state variables in parallel and use a rendez-vous system to test them in order to determine if we have an accept state, or a fail state (for which we must rollback). So even if one of the state is not bound in terms of length, the other one (exploring the possible lengths of reorderable of non-blocking combining characters) is clearly finite (so you don't need to count very far. So you just need a single additional byte of storage for storing the second state variable in the global state of your FSA. The size of the global state variable only depends on the number of alternatives in your regexp and it is also bound (limited to the length of the source regexp: even if if your regexp speicification string is 1000 characters long, you know that you will never need more than 1000 bytes to represent it, but of course it will not be a simple 32-bit integer: this structure can represent billions of billions of billions of possible states without needing to transform the FSA to a pure deterministic FSA with a single integer and without having to build a single MxN transition matrix (with M columns for each possible character class, and N rows for each each deterministic state, where each cell contains the value of for next deterministic state: this will not work) Even if your regexp is so complex that it requires a specification string that is 100KB long, your global state variable will never be longer than 100KB. But of course, this structure is a bit less easier to use when advancing: you have to advance all active states in parallel using the current input character with each transition submatrix (which is really small as well with just a couple of elements that can fit in a small structure with fixed size: an accept character, limited to 21 bits with Unicode, or a character class index, and a few flags, 3 bits, for saying if this character is advancable, or if the current state is an accept state or a failure state, or to indicate the presence of an alternative and give an index to its own branch by specifying only which elementary state variable is used for that alternative within the structure of the global state variable). In summary the global state varaible is just a small array of 32 bit integers for the most complex regexps you will encounter (I don't think that 100KB regexps are very common, almost of them are below 1KB, so your global state variable will fit in 4KB of memory, and the transition matrix will also fit in 4KB).

