RE: How to escape strings for re.finditer?

avi.e.gross Tue, 28 Feb 2023 10:34:29 -0800

Roel,

You make some good points. One to consider is that when you ask a regular 
expression matcher to search using something that uses NO regular expression 
features, much of the complexity disappears and what it creates is probably 
similar enough to what you get with a string search except that loops and all 
are written as something using fast functions probably written in C.

That is one reason the roll your own versions have a disadvantage unless you 
roll your own in a similar way by writing a similar C function.

Nobody has shown us what really should be out there of a simple but fast text 
search algorithm that does a similar job and it may still be out there, but as 
you point out, perhaps it is not needed as long as people just use the re 
version.

Avi

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail....@python.org> On 
Behalf Of Roel Schroeven
Sent: Tuesday, February 28, 2023 4:33 AM
To: python-list@python.org
Subject: Re: How to escape strings for re.finditer?

Op 28/02/2023 om 3:44 schreef Thomas Passin:
> On 2/27/2023 9:16 PM, avi.e.gr...@gmail.com wrote:
>> And, just for fun, since there is nothing wrong with your code, this 
>> minor change is terser:
>>
>>>>> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
>>>>> for match in re.finditer(re.escape('abc_degree + 1') , example):
>> ...     print(match.start(), match.end()) ...
>> ...
>> 4 18
>> 26 40
>
> Just for more fun :) -
>
> Without knowing how general your expressions will be, I think the 
> following version is very readable, certainly more readable than regexes:
>
> example = 'X - abc_degree + 1 + qq + abc_degree + 1'
> KEY = 'abc_degree + 1'
>
> for i in range(len(example)):
>     if example[i:].startswith(KEY):
>         print(i, i + len(KEY))
> # prints:
> 4 18
> 26 40
I think it's often a good idea to use a standard library function instead of 
rolling your own. The issue becomes less clear-cut when the standard library 
doesn't do exactly what you need (as here, where
re.finditer() uses regular expressions while the use case only uses simple 
search strings). Ideally there would be a str.finditer() method we could use, 
but in the absence of that I think we still need to consider using the 
almost-but-not-quite fitting re.finditer().

Two reasons:

(1) I think it's clearer: the name tells us what it does (though of course we 
could solve this in a hand-written version by wrapping it in a suitably named 
function).

(2) Searching for a string in another string, in a performant way, is not as 
simple as it first appears. Your version works correctly, but slowly. In some 
situations it doesn't matter, but in other cases it will. For better 
performance, string searching algorithms jump ahead either when they found a 
match or when they know for sure there isn't a match for some time (see e.g. 
the Boyer–Moore string-search algorithm). 
You could write such a more efficient algorithm, but then it becomes more 
complex and more error-prone. Using a well-tested existing function becomes 
quite attractive.

To illustrate the difference performance, I did a simple test (using the 
paragraph above is test text):

     import re
     import timeit

     def using_re_finditer(key, text):
         matches = []
         for match in re.finditer(re.escape(key), text):
             matches.append((match.start(), match.end()))
         return matches

     def using_simple_loop(key, text):
         matches = []
         for i in range(len(text)):
             if text[i:].startswith(key):
                 matches.append((i, i + len(key)))
         return matches

     CORPUS = """Searching for a string in another string, in a performant way, 
is
     not as simple as it first appears. Your version works correctly, but 
slowly.
     In some situations it doesn't matter, but in other cases it will. 
For better
     performance, string searching algorithms jump ahead either when they found 
a
     match or when they know for sure there isn't a match for some time (see 
e.g.
     the Boyer–Moore string-search algorithm). You could write such a more
     efficient algorithm, but then it becomes more complex and more error-prone.
     Using a well-tested existing function becomes quite attractive."""
     KEY = 'in'
     print('using_simple_loop:',
timeit.repeat(stmt='using_simple_loop(KEY, CORPUS)', globals=globals(),
number=1000))
     print('using_re_finditer:',
timeit.repeat(stmt='using_re_finditer(KEY, CORPUS)', globals=globals(),
number=1000))

This does 5 runs of 1000 repetitions each, and reports the time in seconds for 
each of those runs.
Result on my machine:

     using_simple_loop: [0.13952950000020792, 0.13063130000000456, 
0.12803450000001249, 0.13186180000002423, 0.13084610000032626]
     using_re_finditer: [0.003861400000005233, 0.004061900000124297, 
0.003478999999970256, 0.003413100000216218, 0.0037320000001273]

We find that in this test re.finditer() is more than 30 times faster (despite 
the overhead of regular expressions.

While speed isn't everything in programming, with such a large difference in 
performance and (to me) no real disadvantages of using re.finditer(), I would 
prefer re.finditer() over writing my own.

--
"The saddest aspect of life right now is that science gathers knowledge faster 
than society gathers wisdom."
         -- Isaac Asimov

--
https://mail.python.org/mailman/listinfo/python-list

-- 
https://mail.python.org/mailman/listinfo/python-list

RE: How to escape strings for re.finditer?

Reply via email to