On Wed., Oct. 14, 2020, 17:37 Tim Peters, <tim.pet...@gmail.com> wrote:
> [Steven D'Aprano <st...@pearwood.info>] > > Perhaps this is a silly suggestion, but could we offer this as an > > external function in the stdlib rather than a string method? > > > > Leave it up to the user to decide whether or not their data best suits > > the find method or the new search function. It sounds like we can offer > > some rough heuristics, but the only way to really know is "try it and > > see which works best for you". > > > > The `string` module is an obvious place for it. > > I think this is premature. There is almost never an optimization > that's a pure win in all cases. For example, on some platforms > `timsort` will never be as fast as the old samplesort in cases with a > very large number of equal elements, and on all platforms `timsort` > consumes more memory. And it's a wash on "random" data on most > platforms. Nevertheless, it was a significant speed win for many - > possibly even most - real-life cases. > > So far, the PR reduces the runtime in the bug report from about 3 1/2 > hours to under a tenth of a second. It would be crazy not to gift > users such dramatic relief by default, unless there are good reasons > not to. Too soon to say. On tests of random strings with characters > following a Zipf distribution (more realistic than uniform but still > very easy to generate - and not contrived in any way to favor either > approach), the current new code it usually faster than the status quo, > in no case has been worse than twice as slow, and in a number of cases > has been 20x faster. It's already clearly a win _overall_. > > The bulk of the "but it's slower" cases now are found with very short > needles (patterns), which was expected (read my comments on the bug > report), and exacerbated by that the structure of the random test > generation is quite likely to create cases where a short needle is > found early in the haystack. This combination makes the higher > preprocessing overhead as damaging as possible. Dennis also expanded > the current code's "32-bit bitmask" trick (which is an extreme > simplification of Daniel Sunday's algorithm) to a fixed-size 256-byte > table of character-class-specific skip counts, which went a _very_ > long way toward eliminating the cases where the current code enjoyed a > "pure luck" advantage over the new code. But that increased the > preprocessing overhead again, which again is relatively most > significant for short needles - those 256 bytes have to be initialized > every time, no matter the lengths of the needle or haystack. > > If the new code were changed to exempt short needles, even now it > might be hard to make an objective case not to adopt it. > This is where it sounds like we might want to go. If we can figure out a reasonable, cheap heuristic to define "too short"-- for either the search string or the string to search -- to use the fancier algorithm then I would support adding the fancy string search. -Brett > But it would leave open a different idea for an "opt-in" facility: one > that allowed to give a needle to a function and get back an object > capturing the results of preprocessing. Then a different wrapper > around the search code that accepted the already-pre-processed info. > There are, e.g., certainly cases where repeated searches for a given > 4-character needle could be sped significantly by exploiting the new > code, but where the overhead of preprocessing is too much to bear in > "random" performance testing. It would also open the door to more > aggressive (expensive) kinds of preprocessing. That's where users may > be able to make useful, informed judgments. > > [David Mertz] > > That said, I do recognize that the `re` module also has pathological > cases, > > and the standard library documentation explicitly says "maybe you want to > > try the third-party `regex` implementation." That is sort of precedent > for > > this approach. > > `regex` also suffers exponential time disasters, they're just harder > to stumble into - and there are books explaining in detail how and why > regex engines can fall into these traps. > > It's far less _expected_ in plain string searches, and Dennis was > aware of the new algorithms because, apparently, (at least) glibc's > memmem switched to one some years ago. So we're playing catch-up here. > _______________________________________________ > Python-Dev mailing list -- python-dev@python.org > To unsubscribe send an email to python-dev-le...@python.org > https://mail.python.org/mailman3/lists/python-dev.python.org/ > Message archived at > https://mail.python.org/archives/list/python-dev@python.org/message/ECPXBBF7OUNYLDURCUKYXIOTGPGVBMHX/ > Code of Conduct: http://python.org/psf/codeofconduct/ >
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/M5BSPBIPE2A7FBIOIBKOZHPKXQPR37XD/ Code of Conduct: http://python.org/psf/codeofconduct/