Here's my attempt at some heuristic motivation:

Try to construct a needle that will perform as poorly as possible when
using the naive two-nested-for-loops algorithm. You'll find that if
there isn't some sort of vague periodicity in your needle, then you
won't ever get *that* unlucky; each particular alignment will fail
early, and if it doesn't then some future alignment would be pigeonholed 
to fail early.

So Crochemore and Perrin's algorithm explicitly handles this "worst case"
of periodic strings. Once we've identified in the haystack some period
from the needle, there's no need to re-match it. We can keep a memory
of how many periods we currently remember matching up, and never re-match
them. This is what gives the O(n) behavior for periodic strings.

But wait! There are some bad needles that aren't quite periodic.
For instance:

    >>> 'ABCABCAABCABC' in 'ABC'*1_000_000

The key insight though is that the worst strings are still
"periodic enough", and if we have two different patterns going on,
then we can intentionally split them apart. For example,
`"xyxyxyxyabcabc" --> "xyxyxyxy" + "abcabc"`. I believe the goal is to
line it up so that if the right half matches but not the left then we
can be sure to skip somewhat far ahead. This might not correspond
exactly with splitting up two patterns. This is glossing over some
details that I'm admittedly still a little hazy on as well, but
hopefully that gives at least a nudge of intuition.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/MXMS5XIV6WJFFRHTH7TBHAO3TC4QIHBZ/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to