Maybe someone reading this can finish the Wikipedia page on Two-Way Search? The code example trails off with a function with some incomprehensible remarks and then a TODO...
On Wed, Oct 14, 2020 at 9:07 AM Tim Peters <tim.pet...@gmail.com> wrote: > Rest assured that Dennis is aware of that pragmatics may change for > shorter needles. > > The code has always made a special-case of 1-character needles, > because it's impossible "even in theory" to improve over > straightforward brute force search then. > > Say the length of the text to search is `t`, and the length of the > pattern `p`. Brute force and the current code have worst case O(t*p) > behavior. The new code, worst case O(t+p). If that's as deep as you > dig into it, it seems all but obvious then that O(t*p) just can't be > that bad when p is small, so keep it simple. > > But there's another twist: the current and new code both have O(t/p) > cases too (but brute force, and even Knuth-Morris-Pratt, don't). That > can be highly beneficial even for p as small as 3. > > Unfortunately, the exact cases in which the current and new code enjoy > O(t/p) behavior aren't the same. > > Lying a bit: In general the current code has just two tricks. One of > those tricks is useless (pure waste of time) if the pattern happens to > end with a repeated character pair, and is most effective if the last > character appears nowhere else in the pattern. The other trick is most > effective if the set of characters in the text has no intersection > with the set of characters in the pattern (note that the search is > certain to fail then), but useless if the set of text characters is a > subset of the set of pattern characters (as, e.g., it very often is in > real-life searches in apps with a small alphabet, like [ACGT}+ for > genomes). > > But I don't know how to characterize when the new code gets maximum > benefit. It's not based on intuitive tricks. The paper that > introduced it[1] says it's based on "a deep theorem on words > known as the Critical Factorization Theorem due to Cesari, Duval, > Vincent, and Lothaire", and I still don't fully understand why it > works. > > It's clear enough, though, that the new code gets real systematic > value out of patterns with repetitions (like "abab"), where the > current code gets real value from those only by accident (e.g., "a" > and "b" happen to appear rarely in the text being searched, in which > case that the pattern has repetitions is irrelevant). > > But, as I said in the first message, the PR is "preliminary". There > are still worlds of tweaks that have been thought of but not yet tried > even once. > > [1] search for "Two-Way String Matching" by Crochemore and Perrin. > _______________________________________________ > Python-Dev mailing list -- python-dev@python.org > To unsubscribe send an email to python-dev-le...@python.org > https://mail.python.org/mailman3/lists/python-dev.python.org/ > Message archived at > https://mail.python.org/archives/list/python-dev@python.org/message/G53VXXYWWEM26S2XKVX5W6P54R47YQTG/ > Code of Conduct: http://python.org/psf/codeofconduct/ > -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-change-the-world/>
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/NHG5YVVVZBQSEZBSCDVLETDTFSHBMKBV/ Code of Conduct: http://python.org/psf/codeofconduct/