[issue42885] Optimize re.search() for \A (and maybe ^)

2022-03-22 Thread Serhiy Storchaka


Change by Serhiy Storchaka :


--
resolution:  -> fixed
stage: patch review -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42885] Optimize re.search() for \A (and maybe ^)

2022-03-22 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:


New changeset 492d4109f4d953c478cb48f17aa32adbb912623b by Serhiy Storchaka in 
branch 'main':
bpo-42885: Optimize search for regular expressions starting with "\A" or "^" 
(GH-32021)
https://github.com/python/cpython/commit/492d4109f4d953c478cb48f17aa32adbb912623b


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42885] Optimize re.search() for \A (and maybe ^)

2022-03-21 Thread Serhiy Storchaka


Change by Serhiy Storchaka :


--
keywords: +patch
pull_requests: +30109
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/32021

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42885] Optimize re.search() for \A (and maybe ^)

2021-01-16 Thread Arnim Rupp


Change by Arnim Rupp :


--
components: +Regular Expressions -Library (Lib)
nosy: +ezio.melotti, mrabarnett

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42885] Optimize re.search() for \A (and maybe ^)

2021-01-16 Thread Arnim Rupp


Arnim Rupp  added the comment:

some more observations, which might be helpful in tracking it down:

x$ is much faster than ^x
A$ is as slow as ^x

$ python3 -m timeit -s "a = 'A'*1" "import re" "re.search('x$', a)"
10 loops, best of 5: 32.9 msec per loop

$ python3 -m timeit -s "a = 'A'*1" "import re" "re.search('A$', a)"
1 loop, best of 5: 802 msec per loop

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42885] Optimize re.search() for \A (and maybe ^)

2021-01-16 Thread Steven D'Aprano


Steven D'Aprano  added the comment:

On Sat, Jan 16, 2021 at 08:59:13AM +, Serhiy Storchaka wrote:
> 
> Serhiy Storchaka  added the comment:
> 
> ^ matches not just the beginning of the string. It matches the 
> beginning of a line, i.e. an anchor just after '\n'.

Only in MULTILINE mode.

I completely agree that in multiline mode '^a' has to search the entire 
string. But in single line mode, if the first character isn't an 'a', 
there is no need to continue searching.

As far as I can see from the docs, ^ in single line mode and \A always 
should be constant time.

> So the original report is rejected, the behavior is expected and 
> cannot be changed. It is not a bug.

I disagree that it is expected behaviour. "Match the start of the 
string" cannot possibly match anything except the start of the string, 
so who would expect it to keep scanning past the start of the string?

(Multiline mode is, of course, different.)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42885] Optimize re.search() for \A (and maybe ^)

2021-01-16 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:

^ matches not just the beginning of the string. It matches the beginning of a 
line, i.e. an anchor just after '\n'. If the input string contains '\n', the 
result cannot be found less than by linear time. If you want to check if the 
beginning of the string matches a regular expression, it is better to use 
match(). If you want the check if the whole string matches it, it is better to 
use fullmatch().

But in some cases you cannot choose what method to use. If you have a set of 
patterns, and only some of them should be anchored to the start of the string, 
you have to use search(). And while linear complexity for ^ is expected, 
search() is not optimized for \A.

So the original report is rejected, the behavior is expected and cannot be 
changed. It is not a bug. But some optimization can be added for \A, and 
perhaps the constant multiplier for ^ can be reduced too.

--
title: Regex performance problem with ^ aka AT_BEGINNING -> Optimize 
re.search() for \A (and maybe ^)
versions:  -Python 3.9

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com