New submission from Matt Miller <[email protected]>:
I was evaluating a few regular expressions for parsing URL. One such
expression
(https://daringfireball.net/2010/07/improved_regex_for_matching_urls) causes
the `re.Pattern` to exhibit some strange behavior (notice the stripped
characters in the `repr`):
```
>>> STR_RE_URL =
>>> r"""(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))"""
>>> print(re.compile(STR_RE_URL))
re.compile('(?i)\\b((?:[a-z][\\w-]+:(?:/{1,3}|[a-z0-9%])|www\\d{0,3}[.]|[a-z0-9.\\-]+[.][a-z]{2,4}/)(?:[^\\s()<>]+|\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\))+(?:\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\)|[^\\s`!()\,
re.IGNORECASE)
```
The reason I started looking at this was because the following string causes
the same `re.Pattern` object's `.search()` method to loop forever for some
reason:
```
>>> weird_str =
>>> """AY:OhQOhQNhQLdLAX78N'7M&6K%4K#4K#7N&9P(JcHOiQE^=8P'F_DJdLC\@9P&D\;IdKHbJ@Z8AY7@Y7AY7B[9E_<Ha?G`>Jc@Jc:F_1PjRRlSOiLKeAKeAGa=D^:F`=Ga=Fa<MhHRmSRlSJc7Ga1Ic3Kd4Jc3Ga0<V&?Y*D]-Hb1Mg7D^/;S@+@)"""
>>> url_pat.search(weird_str)
```
The `.search(weird_str)` will never exit.
I assume the `.search()` taking forever is is an error in the expression but
the fact that it causes the `repr` to strip some characters was something I
thought should be looked into.
I have not tested this on any other versions of Python.
----------
components: Library (Lib)
messages: 370784
nosy: Matt Miller
priority: normal
severity: normal
status: open
title: Strange regex cycle
versions: Python 3.7
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue40879>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com