[Python-Dev] Re: Changing Python's string search algorithms

2020-10-13 Thread Larry Hastings
On 10/13/20 9:54 AM, Tim Peters wrote: Due to the natures of the old and new algorithms, neither will be faster or slower in all cases, the newer one should never be dramatically slower than the old one, the new one will be spectacularly faster in some cases, and "in general" it seems impossible

[Python-Dev] Re: Changing Python's string search algorithms

2020-10-13 Thread David Mertz
I'm sure that the large majority of the string searches I've done are in Larry's tiny category. However, I also think that big data needs are increasing, and things like FASTA files can be enormously large texts that one has good reasons to search on. If there is a heuristic switch between algori

[Python-Dev] Re: Changing Python's string search algorithms

2020-10-14 Thread Tim Peters
Rest assured that Dennis is aware of that pragmatics may change for shorter needles. The code has always made a special-case of 1-character needles, because it's impossible "even in theory" to improve over straightforward brute force search then. Say the length of the text to search is `t`, and t

[Python-Dev] Re: Changing Python's string search algorithms

2020-10-14 Thread Guido van Rossum
Maybe someone reading this can finish the Wikipedia page on Two-Way Search? The code example trails off with a function with some incomprehensible remarks and then a TODO... On Wed, Oct 14, 2020 at 9:07 AM Tim Peters wrote: > Rest assured that Dennis is aware of that pragmatics may change for >

[Python-Dev] Re: Changing Python's string search algorithms

2020-10-14 Thread Tim Peters
[Guido] > Maybe someone reading this can finish the Wikipedia page on > Two-Way Search? The code example trails off with a function with > some incomprehensible remarks and then a TODO.. Yes, the Wikipedia page is worse than useless in its current state, although some of the references it lists ar

[Python-Dev] Re: Changing Python's string search algorithms

2020-10-14 Thread Tal Einat
On Wed, Oct 14, 2020 at 7:57 PM Tim Peters wrote: > > [Guido] > > Maybe someone reading this can finish the Wikipedia page on > > Two-Way Search? The code example trails off with a function with > > some incomprehensible remarks and then a TODO.. > > Yes, the Wikipedia page is worse than useless i

[Python-Dev] Re: Changing Python's string search algorithms

2020-10-14 Thread Steven D'Aprano
Perhaps this is a silly suggestion, but could we offer this as an external function in the stdlib rather than a string method? Leave it up to the user to decide whether or not their data best suits the find method or the new search function. It sounds like we can offer some rough heuristics, bu

[Python-Dev] Re: Changing Python's string search algorithms

2020-10-14 Thread David Mertz
On Wed, Oct 14, 2020 at 7:45 PM Steven D'Aprano wrote: > Perhaps this is a silly suggestion, but could we offer this as an > external function in the stdlib rather than a string method? > That feels unworkable to me. For one thing, the 'in' operator hits this same issue, doesn't it? But for ano

[Python-Dev] Re: Changing Python's string search algorithms

2020-10-14 Thread Tim Peters
[Steven D'Aprano ] > Perhaps this is a silly suggestion, but could we offer this as an > external function in the stdlib rather than a string method? > > Leave it up to the user to decide whether or not their data best suits > the find method or the new search function. It sounds like we can offer

[Python-Dev] Re: Changing Python's string search algorithms

2020-10-14 Thread Chris Angelico
On Thu, Oct 15, 2020 at 11:38 AM Tim Peters wrote: > I think this is premature. There is almost never an optimization > that's a pure win in all cases. For example, on some platforms > `timsort` will never be as fast as the old samplesort in cases with a > very large number of equal elements, and

[Python-Dev] Re: Changing Python's string search algorithms

2020-10-14 Thread Guido van Rossum
On Wed, Oct 14, 2020 at 9:56 AM Tim Peters wrote: > [Guido] > > Maybe someone reading this can finish the Wikipedia page on > > Two-Way Search? The code example trails off with a function with > > some incomprehensible remarks and then a TODO.. > > Yes, the Wikipedia page is worse than useless in

[Python-Dev] Re: Changing Python's string search algorithms

2020-10-14 Thread Brett Cannon
On Wed., Oct. 14, 2020, 17:37 Tim Peters, wrote: > [Steven D'Aprano ] > > Perhaps this is a silly suggestion, but could we offer this as an > > external function in the stdlib rather than a string method? > > > > Leave it up to the user to decide whether or not their data best suits > > the find

[Python-Dev] Re: Changing Python's string search algorithms

2020-10-14 Thread Tim Peters
[Guido] > The key seems to be: Except none of that quoted text (which I'll skip repeating) gives the slightest clue as to _why_ it may be an improvement. So you split the needle into two pieces. So what? What's the _point_? Why would someone even imagine that might help? Why is one half then

[Python-Dev] Re: Changing Python's string search algorithms

2020-10-14 Thread Greg Ewing
On 15/10/20 1:45 pm, Chris Angelico wrote: So it'd be heuristics in the core language that choose a good default for most situations, and then a str method that returns a preprocessed needle. Or maybe cache the results of the preprocessing? -- Greg _

[Python-Dev] Re: Changing Python's string search algorithms

2020-10-15 Thread Dennis Sweeney
Here's my attempt at some heuristic motivation: Try to construct a needle that will perform as poorly as possible when using the naive two-nested-for-loops algorithm. You'll find that if there isn't some sort of vague periodicity in your needle, then you won't ever get *that* unlucky; each particu

[Python-Dev] Re: Changing Python's string search algorithms

2020-10-15 Thread Tim Peters
[Dennis Sweeney ] > Here's my attempt at some heuristic motivation: Thanks, Dennis! It helps. One gloss: > > The key insight though is that the worst strings are still > "periodic enough", and if we have two different patterns going on, > then we can intentionally split them apart. The am

[Python-Dev] Re: Changing Python's string search algorithms

2020-10-15 Thread Tim Peters
[Guido] > I am not able to dream up any hard cases -- like other posters, > my own use of substring search is usually looking for a short > string in a relatively short piece of text. I doubt even the current > optimizations matter to my uses. I should have responded to this part differently. W

[Python-Dev] Re: Changing Python's string search algorithms

2020-10-16 Thread Marco Sulla
Excuse me if I intrude in an algorithm that I have not understood, but the new optimization can be applied to regexps too? ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python

[Python-Dev] Re: Changing Python's string search algorithms

2020-10-16 Thread Tim Peters
[Marco Sulla] > Excuse me if I intrude in an algorithm that I have not understood, but > the new optimization can be applied to regexps too? The algorithm is limited to searching for fixed strings. However, _part_ of our regexp implementation (the bit that looks ahead for a fixed string) will inh

[Python-Dev] Re: Changing Python's string search algorithms

2020-10-16 Thread Tim Peters
I don't plan on making a series of these posts, just this one, to give people _some_ insight into why the new algorithm gets systematic benefits the current algorithm can't. It splits the needle into two pieces, u and v, very carefully selected by subtle linear-time needle preprocessing (and it's

[Python-Dev] Re: Changing Python's string search algorithms

2020-10-16 Thread Chris Angelico
On Sat, Oct 17, 2020 at 12:30 PM Tim Peters wrote: > > I don't plan on making a series of these posts, just this one, to give > people _some_ insight into why the new algorithm gets systematic > benefits the current algorithm can't. It splits the needle into two > pieces, u and v, very carefully

[Python-Dev] Re: Changing Python's string search algorithms

2020-10-16 Thread Tim Peters
[Tim Peters, explains one of the new algorithm's surprisingly effective moving parts] [Chris Angelico ] > Thank you, great explanation. Can this be added to the source code > if/when this algorithm gets implemented? No ;-) While I enjoy trying to make hard things clear(er), I need to understand

[Python-Dev] Re: Changing Python's string search algorithms

2020-10-17 Thread Antoine Pitrou
On Fri, 16 Oct 2020 20:24:22 -0500 Tim Peters wrote: > > Note that no "extra" storage is needed to exploit this. No character > lookups, no extra expenses in time or space of any kind. Just "if we > mismatch on the k'th try, we can jump ahead k positions". Ok, so that means that on a N-characte

[Python-Dev] Re: Changing Python's string search algorithms

2020-10-17 Thread Tim Peters
[Tim] >> Note that no "extra" storage is needed to exploit this. No character >> lookups, no extra expenses in time or space of any kind. Just "if we >> mismatch on the k'th try, we can jump ahead k positions". [Antoine Pitrou ] > Ok, so that means that on a N-character haystack, it'll always do

[Python-Dev] Re: Changing Python's string search algorithms

2020-10-17 Thread Tim Peters
[Tim] > ... > Alas, the higher preprocessing costs leave the current PR slower in "too > many" cases too, especially when the needle is short and found early > in the haystack. Then any preprocessing cost approaches a pure waste > of time. But that was this morning. Since then, Dennis changed the

[Python-Dev] Re: Changing Python's string search algorithms

2020-10-18 Thread Greg Ewing
On 17/10/20 3:26 pm, Tim Peters wrote: Tal Einat posted earlier that he was keen to try to work up a clear explanation, and I look forward to that. All the expositions I've found of this algorithm so far are hard to approach. Maybe Mathologer or 3blue1brown could be persuaded to help? They seem

[Python-Dev] Re: Changing Python's string search algorithms

2020-10-18 Thread Raymond Hettinger
> On Oct 17, 2020, at 2:40 PM, Tim Peters wrote: > > Still waiting for someone who thinks string search speed is critical > in their real app to give it a try. In the absence of that, I endorse > merging this. Be bold. Merge it. :-) Raymond __