Dennis Sweeney <[email protected]> added the comment:
In "Fast String Searching" (1991), A. Hume and D. Sunday emphasize that most of
the runtime of string-search algorithms occurs in a code path that skips over
immediately-disqualified alignments. As such, that paper recommends extracting
a hot "Horspool" code path and keeping it small. For that reason, I'm posting a
PR which boils down that fast path to the following:
while (skip > 0 && window_last < haystack_end) {
window_last += skip;
skip = table[(*window_last) & 63];
}
This uses two memory accesses, whereas the last implementation used three (the
extra to get window[cut]). This requires the skip table to change slightly,
since it is indexed by the last character in the current haystack window rather
than one character later ("fast" versus "sd1" in the paper). The paper also
recommends unrolling this loop 3x, but my benchmarks found no benefit to
unrolling.
The PR also refactors some of the main fastsearch code into separate functions,
and computes and then uses a similar gap/skip integer used already in the
default implementation (Hume and Sunday call this "md2"). It retains a
linear-time constant space worst-case with the Two-Way algorithm.
There are a bunch of benchmarks here, both on Zipf-distributed random strings
and on a few c/rst/python/binary files in the cpython repo. They compare the
current main branch to the PR:
https://gist.github.com/sweeneyde/fc20ed5e72dc6b0f3b41c0abaf4ec3be
Summary of those results:
On the Zipf strings:
* 83 cases slower (at most 1.59x slower)
* 654 cases faster (up to 4.49x faster)
* Geometric mean: 1.10x faster
On the "Real world" source files:
* 76 cases slower (at most 2.41x slower)
* 420 cases faster (up to 18.12x faster)
* Geometric mean: 1.33x faster
----------
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue41972>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com