Dennis Sweeney <[email protected]> added the comment:
> Offhand do you know what the _best_ timing for two-way search is in a
> pattern-not-found case?
Looking at the glibc implementation, in the top-level "else" clause
(for when the string isn't completely periodic),
we have:
period = MAX (suffix, needle_len - suffix) + 1;
so period > needle_len / 2.
Then during each iteration of the j-loop (where j is the index into the
haystack),
j is sometimes incremented by `period`, which would probably give O(n/m) best
case.
Here's a sublinear example for the two-way algorithm:
N = 10**7
haystack = "ABC" * N
needle1 = "ABC" * (N // 10) + "DDD"
needle2 = "ABC" * 10 + "DDD"
=== Results ===
Pure Python Two-Way, shorter needle: 1.7142754
Pure Python Two-Way. Longer needle: 0.0018845000000000667
Python builtin, shorter needle: 0.031071400000000082
Python builtin, longer needle: 0.030566099999999707
This case is surprisingly better than the builtin:
N = 10**7
haystack = "ABC" * N
needle1 = "DDD" + "ABC" * (N // 10)
needle2 = "DDD" + "ABC" * 10
=== Results ===
Pure Python Two-Way, shorter needle: 0.0020895000000000774
Pure Python Two-Way. Longer needle: 0.0017878999999999534
Python builtin, shorter needle: 0.029998100000000028
Python builtin, longer needle: 0.02963439999999995
This was measured using the attached fastsearch.py. There are some
manipulations in there like string slicing that would make more sense as
pointer math in C, so especially accounting for that, I'm guessing it would be
pretty competitive in a lot of cases.
A lot of the implementations look like they use a complete Boyer-Moore table to
go sublinear in even more cases, which it seems we want to avoid. I'm not
certain, but it's conceivable that the existing techniques of using just one
one delta-value or the "Bloom filter" could be thrown in there to get some of
the same improvements. Although that could be redundant -- I'm not sure.
----------
Added file: https://bugs.python.org/file49507/fastsearch.py
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue41972>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com