On Sun, 13 Oct 2019 10:04:34 +0200 Hans Åberg via Unicode <unicode@unicode.org> wrote:
> > On 13 Oct 2019, at 00:37, Richard Wordingham via Unicode > > <unicode@unicode.org> wrote: > > > > On Sat, 12 Oct 2019 21:36:45 +0200 > > Hans Åberg via Unicode <unicode@unicode.org> wrote: > > > >>> On 12 Oct 2019, at 14:17, Richard Wordingham via Unicode > >>> <unicode@unicode.org> wrote: > >>> > >>> But remember that 'having longer first' is meaningless for a > >>> non-deterministic finite automaton that does a single pass through > >>> the string to be searched. > >> > >> It is possible to identify all submatches deterministically in > >> linear time without backtracking — I a made an algorithm for > >> that. > > > > That's impressive, as the number of possible submatches for > > a*(a*)a* is quadratic in the string length. > > That is probably after the possibilities in the matching graph have > been expanded, which can even be exponential. As an analogy, think of > a polynomial product, I compute the product, not the expansion. I'm now beginning to wonder what you are claiming. One thing one can do without backtracking is to determine which capture groups capture something, and which combinations of capturing or not occur. That's a straightforward extension of doing the overall 'recognition' in linear time - at least, linear in length (n) of the searched string. (I say straightforward, but it would mess up my state naming algorithm.) The time can also depend on the complexity of the regular expression, which can be bounded by the length (m) of the expression if working with mere strings, giving time O(mn) if one doesn't undertake the worst case O(2^m) task of converting the non-deterministic FSM to a deterministic FSM. Using m as a complexity measure for traces may be misleading, and I think plain wrong; for moderate m, the complexity can easily go up as fast as m^10, and I think higher powers are possible. Strings exercising the higher complexities are linguistically implausible. Regards, Richard.