Maybe someone reading this can finish the Wikipedia page on Two-Way Search?
The code example trails off with a function with some incomprehensible
remarks and then a TODO...

On Wed, Oct 14, 2020 at 9:07 AM Tim Peters <tim.pet...@gmail.com> wrote:

> Rest assured that Dennis is aware of that pragmatics may change for
> shorter needles.
>
> The code has always made a special-case of 1-character needles,
> because it's impossible "even in theory" to improve over
> straightforward brute force search then.
>
> Say the length of the text to search is `t`, and the length of the
> pattern `p`. Brute force and the current code have worst case O(t*p)
> behavior. The new code, worst case O(t+p). If that's as deep as you
> dig into it, it seems all but obvious then that O(t*p) just can't be
> that bad when p is small, so keep it simple.
>
> But there's another twist:  the current and new code both have O(t/p)
> cases too (but brute force, and even Knuth-Morris-Pratt, don't). That
> can be highly beneficial even for p as small as 3.
>
> Unfortunately, the exact cases in which the current and new code enjoy
> O(t/p) behavior aren't the same.
>
> Lying a bit:  In general the current code has just two tricks. One of
> those tricks is useless (pure waste of time) if the pattern happens to
> end with a repeated character pair, and is most effective if the last
> character appears nowhere else in the pattern. The other trick is most
> effective if the set of characters in the text has no intersection
> with the set of characters in the pattern (note that the search is
> certain to fail then), but useless if the set of text characters is a
> subset of the set of pattern characters (as, e.g., it very often is in
> real-life searches in apps with a small alphabet, like [ACGT}+ for
> genomes).
>
> But I don't know how to characterize when the new code gets maximum
> benefit. It's not based on intuitive tricks.  The paper that
> introduced it[1] says it's based on "a deep theorem on words
> known as the Critical Factorization Theorem due to Cesari, Duval,
> Vincent, and Lothaire", and I still don't fully understand why it
> works.
>
> It's clear enough, though, that the new code gets real systematic
> value out of patterns with repetitions (like "abab"), where the
> current code gets real value from those only by accident (e.g., "a"
> and "b" happen to appear rarely in the text being searched, in which
> case that the pattern has repetitions is irrelevant).
>
> But, as I said in the first message, the PR is "preliminary". There
> are still worlds of tweaks that have been thought of but not yet tried
> even once.
>
> [1] search for "Two-Way String Matching" by Crochemore and Perrin.
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-le...@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/G53VXXYWWEM26S2XKVX5W6P54R47YQTG/
> Code of Conduct: http://python.org/psf/codeofconduct/
>


-- 
--Guido van Rossum (python.org/~guido)
*Pronouns: he/him **(why is my pronoun here?)*
<http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-change-the-world/>
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/NHG5YVVVZBQSEZBSCDVLETDTFSHBMKBV/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to