[Python-Dev] Re: Changing Python's string search algorithms

Brett Cannon Wed, 14 Oct 2020 20:48:12 -0700

On Wed., Oct. 14, 2020, 17:37 Tim Peters, <tim.pet...@gmail.com> wrote:


> [Steven D'Aprano <st...@pearwood.info>]
> > Perhaps this is a silly suggestion, but could we offer this as an
> > external function in the stdlib rather than a string method?
> >
> > Leave it up to the user to decide whether or not their data best suits
> > the find method or the new search function. It sounds like we can offer
> > some rough heuristics, but the only way to really know is "try it and
> > see which works best for you".
> >
> > The `string` module is an obvious place for it.
>
> I think this is premature. There is almost never an optimization
> that's a pure win in all cases. For example, on some platforms
> `timsort` will never be as fast as the old samplesort in cases with a
> very large number of equal elements, and on all platforms `timsort`
> consumes more memory.  And it's a wash on "random" data on most
> platforms. Nevertheless, it was a significant speed win for many -
> possibly even most - real-life cases.
>
> So far, the PR reduces the runtime in the bug report from about 3 1/2
> hours to under a tenth of a second. It would be crazy not to gift
> users such dramatic relief by default, unless there are good reasons
> not to. Too soon to say. On tests of random strings with characters
> following a Zipf distribution (more realistic than uniform but still
> very easy to generate - and not contrived in any way to favor either
> approach), the current new code it usually faster than the status quo,
> in no case has been worse than twice as slow, and in a number of cases
> has been 20x faster.  It's already clearly a win _overall_.
>
> The bulk of the "but it's slower" cases now are found with very short
> needles (patterns), which was expected (read my comments on the bug
> report), and exacerbated by that the structure of the random test
> generation is quite likely to create cases where a short needle is
> found early in the haystack. This combination makes the higher
> preprocessing overhead as damaging as possible. Dennis also expanded
> the current code's "32-bit bitmask" trick (which is an extreme
> simplification of Daniel Sunday's algorithm) to a fixed-size 256-byte
> table of character-class-specific skip counts, which went a _very_
> long way toward eliminating the cases where the current code enjoyed a
> "pure luck" advantage over the new code. But that increased the
> preprocessing overhead again, which again is relatively most
> significant for short needles - those 256 bytes have to be initialized
> every time, no matter the lengths of the needle or haystack.
>
> If the new code were changed to exempt short needles, even now it
> might be hard to make an objective case not to adopt it.
>

This is where it sounds like we might want to go. If we can figure out a
reasonable, cheap heuristic to define "too short"-- for either the search
string or the string to search -- to use the fancier algorithm then I would
support adding the fancy string search.

-Brett


> But it would leave open a different idea for an "opt-in" facility: one
> that allowed to give a needle to a function and get back an object
> capturing the results of preprocessing.  Then a different wrapper
> around the search code that accepted the already-pre-processed info.
> There are, e.g., certainly cases where repeated searches for a given
> 4-character needle could be sped significantly by exploiting the new
> code, but where the overhead of preprocessing is too much to bear in
> "random" performance testing. It would also open the door to more
> aggressive (expensive) kinds of preprocessing.  That's where users may
> be able to make useful, informed judgments.
>
> [David Mertz]
> > That said, I do recognize that the `re` module also has pathological
> cases,
> > and the standard library documentation explicitly says "maybe you want to
> > try the third-party `regex` implementation."  That is sort of precedent
> for
> > this approach.
>
> `regex` also suffers exponential time disasters, they're just harder
> to stumble into - and there are books explaining in detail how and why
> regex engines can fall into these traps.
>
> It's far less _expected_ in plain string searches, and Dennis was
> aware of the new algorithms because, apparently, (at least) glibc's
> memmem switched to one some years ago. So we're playing catch-up here.
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-le...@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/ECPXBBF7OUNYLDURCUKYXIOTGPGVBMHX/
> Code of Conduct: http://python.org/psf/codeofconduct/
>

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/M5BSPBIPE2A7FBIOIBKOZHPKXQPR37XD/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Changing Python's string search algorithms

Reply via email to