New submission from Matt Miller <usr.bin.bour...@gmail.com>: I was evaluating a few regular expressions for parsing URL. One such expression (https://daringfireball.net/2010/07/improved_regex_for_matching_urls) causes the `re.Pattern` to exhibit some strange behavior (notice the stripped characters in the `repr`):
``` >>> STR_RE_URL = >>> r"""(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))""" >>> print(re.compile(STR_RE_URL)) re.compile('(?i)\\b((?:[a-z][\\w-]+:(?:/{1,3}|[a-z0-9%])|www\\d{0,3}[.]|[a-z0-9.\\-]+[.][a-z]{2,4}/)(?:[^\\s()<>]+|\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\))+(?:\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\)|[^\\s`!()\, re.IGNORECASE) ``` The reason I started looking at this was because the following string causes the same `re.Pattern` object's `.search()` method to loop forever for some reason: ``` >>> weird_str = >>> """AY:OhQOhQNhQLdLAX78N'7M&6K%4K#4K#7N&9P(JcHOiQE^=8P'F_DJdLC\@9P&D\;IdKHbJ@Z8AY7@Y7AY7B[9E_<Ha?G`>Jc@Jc:F_1PjRRlSOiLKeAKeAGa=D^:F`=Ga=Fa<MhHRmSRlSJc7Ga1Ic3Kd4Jc3Ga0<V&?Y*D]-Hb1Mg7D^/;S@+@)""" >>> url_pat.search(weird_str) ``` The `.search(weird_str)` will never exit. I assume the `.search()` taking forever is is an error in the expression but the fact that it causes the `repr` to strip some characters was something I thought should be looked into. I have not tested this on any other versions of Python. ---------- components: Library (Lib) messages: 370784 nosy: Matt Miller priority: normal severity: normal status: open title: Strange regex cycle versions: Python 3.7 _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue40879> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com