[issue40879] Strange regex cycle

Matt Miller Fri, 05 Jun 2020 13:50:13 -0700

New submission from Matt Miller <usr.bin.bour...@gmail.com>:

I was evaluating a few regular expressions for parsing URL.  One such 
expression 
(https://daringfireball.net/2010/07/improved_regex_for_matching_urls) causes 
the `re.Pattern` to exhibit some strange behavior (notice the stripped 
characters in the `repr`):


```
>>> STR_RE_URL = 
>>> r"""(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))"""
>>> print(re.compile(STR_RE_URL))
re.compile('(?i)\\b((?:[a-z][\\w-]+:(?:/{1,3}|[a-z0-9%])|www\\d{0,3}[.]|[a-z0-9.\\-]+[.][a-z]{2,4}/)(?:[^\\s()<>]+|\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\))+(?:\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\)|[^\\s`!()\,
 re.IGNORECASE)
```

The reason I started looking at this was because the following string causes 
the same `re.Pattern` object's `.search()` method to loop forever for some 
reason:

```
>>> weird_str = 
>>> """AY:OhQOhQNhQLdLAX78N'7M&6K%4K#4K#7N&9P(JcHOiQE^=8P'F_DJdLC\@9P&D\;IdKHbJ@Z8AY7@Y7AY7B[9E_<Ha?G`>Jc@Jc:F_1PjRRlSOiLKeAKeAGa=D^:F`=Ga=Fa<MhHRmSRlSJc7Ga1Ic3Kd4Jc3Ga0<V&?Y*D]-Hb1Mg7D^/;S@+@)"""
>>> url_pat.search(weird_str)
```

The `.search(weird_str)` will never exit.


I assume the `.search()` taking forever is is an error in the expression but 
the fact that it causes the `repr` to strip some characters was something I 
thought should be looked into.

I have not tested this on any other versions of Python.

----------
components: Library (Lib)
messages: 370784
nosy: Matt Miller
priority: normal
severity: normal
status: open
title: Strange regex cycle
versions: Python 3.7

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue40879>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue40879] Strange regex cycle

Reply via email to