Re: How to escape strings for re.finditer?

Jen Kris via Python-list Tue, 28 Feb 2023 10:19:50 -0800

I wrote my previous message before reading this.  Thank you for the test you 
ran -- it answers the question of performance.  You show that re.finditer is 
30x faster, so that certainly recommends that over a simple loop, which 
introduces looping overhead.



Feb 28, 2023, 05:44 by li...@tompassin.net:

> On 2/28/2023 4:33 AM, Roel Schroeven wrote:
>
>> Op 28/02/2023 om 3:44 schreef Thomas Passin:
>>
>>> On 2/27/2023 9:16 PM, avi.e.gr...@gmail.com wrote:
>>>
>>>> And, just for fun, since there is nothing wrong with your code, this minor 
>>>> change is terser:
>>>>
>>>>>>> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>>>>>>> for match in re.finditer(re.escape('abc_degree + 1') , example):
>>>>>>>
>>>> ...     print(match.start(), match.end())
>>>> ...
>>>> ...
>>>> 4 18
>>>> 26 40
>>>>
>>>
>>> Just for more fun :) -
>>>
>>> Without knowing how general your expressions will be, I think the following 
>>> version is very readable, certainly more readable than regexes:
>>>
>>> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>>> KEY = 'abc_degree + 1'
>>>
>>> for i in range(len(example)):
>>>     if example[i:].startswith(KEY):
>>>         print(i, i + len(KEY))
>>> # prints:
>>> 4 18
>>> 26 40
>>>
>> I think it's often a good idea to use a standard library function instead of 
>> rolling your own. The issue becomes less clear-cut when the standard library 
>> doesn't do exactly what you need (as here, where re.finditer() uses regular 
>> expressions while the use case only uses simple search strings). Ideally 
>> there would be a str.finditer() method we could use, but in the absence of 
>> that I think we still need to consider using the almost-but-not-quite 
>> fitting re.finditer().
>>
>> Two reasons:
>>
>> (1) I think it's clearer: the name tells us what it does (though of course 
>> we could solve this in a hand-written version by wrapping it in a suitably 
>> named function).
>>
>> (2) Searching for a string in another string, in a performant way, is not as 
>> simple as it first appears. Your version works correctly, but slowly. In 
>> some situations it doesn't matter, but in other cases it will. For better 
>> performance, string searching algorithms jump ahead either when they found a 
>> match or when they know for sure there isn't a match for some time (see e.g. 
>> the Boyer–Moore string-search algorithm). You could write such a more 
>> efficient algorithm, but then it becomes more complex and more error-prone. 
>> Using a well-tested existing function becomes quite attractive.
>>
>
> Sure, it all depends on what the real task will be.  That's why I wrote 
> "Without knowing how general your expressions will be". For the example 
> string, it's unlikely that speed will be a factor, but who knows what target 
> strings and keys will turn up in the future?
>
>> To illustrate the difference performance, I did a simple test (using the 
>> paragraph above is test text):
>>
>>      import re
>>      import timeit
>>
>>      def using_re_finditer(key, text):
>>          matches = []
>>          for match in re.finditer(re.escape(key), text):
>>              matches.append((match.start(), match.end()))
>>          return matches
>>
>>
>>      def using_simple_loop(key, text):
>>          matches = []
>>          for i in range(len(text)):
>>              if text[i:].startswith(key):
>>                  matches.append((i, i + len(key)))
>>          return matches
>>
>>
>>      CORPUS = """Searching for a string in another string, in a performant 
>> way, is
>>      not as simple as it first appears. Your version works correctly, but 
>> slowly.
>>      In some situations it doesn't matter, but in other cases it will. For 
>> better
>>      performance, string searching algorithms jump ahead either when they 
>> found a
>>      match or when they know for sure there isn't a match for some time (see 
>> e.g.
>>      the Boyer–Moore string-search algorithm). You could write such a more
>>      efficient algorithm, but then it becomes more complex and more 
>> error-prone.
>>      Using a well-tested existing function becomes quite attractive."""
>>      KEY = 'in'
>>      print('using_simple_loop:', timeit.repeat(stmt='using_simple_loop(KEY, 
>> CORPUS)', globals=globals(), number=1000))
>>      print('using_re_finditer:', timeit.repeat(stmt='using_re_finditer(KEY, 
>> CORPUS)', globals=globals(), number=1000))
>>
>> This does 5 runs of 1000 repetitions each, and reports the time in seconds 
>> for each of those runs.
>> Result on my machine:
>>
>>      using_simple_loop: [0.13952950000020792, 0.13063130000000456, 
>> 0.12803450000001249, 0.13186180000002423, 0.13084610000032626]
>>      using_re_finditer: [0.003861400000005233, 0.004061900000124297, 
>> 0.003478999999970256, 0.003413100000216218, 0.0037320000001273]
>>
>> We find that in this test re.finditer() is more than 30 times faster 
>> (despite the overhead of regular expressions.
>>
>> While speed isn't everything in programming, with such a large difference in 
>> performance and (to me) no real disadvantages of using re.finditer(), I 
>> would prefer re.finditer() over writing my own.
>>
>
> -- 
> https://mail.python.org/mailman/listinfo/python-list
>

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: How to escape strings for re.finditer?

Reply via email to