New submission from Niko Matsakis <n...@alum.mit.edu>:

Executing code like this:

>>> r = re.compile(r'(\w+)*=.*')
>>> r.match("abcdefghijklmnopqrstuvwxyz")

takes a long time (around 12 seconds, on my machine).  Presumably this is 
because it is enumerating all the various ways to divvy up the alphabet for 
(\w+), even though there is no "=" sign to be found.  In contrast, in perl a 
regular expression like that seems to run instantly.

This could be optimized by recognizing that no "=" sign was found, and thus it 
does not matter how the first part of the regular expression matches, so there 
is no need to try additional possibilities.  To some extent, of course, the 
answer is just "don't write regular expressions like that."  This example is 
reduced down from a real regexp where the potential inefficiency was less 
obvious.  Nonetheless the general optimization of recognizing when further 
re-enumeration is not necessary makes sense more generally.

In any case, I am submitting the bug report merely to raise the issue as a 
possible future optimization, not to suggest that it must be addressed 
immediately (or even at all).

----------
components: Regular Expressions
messages: 129262
nosy: nikomatsakis
priority: normal
severity: normal
status: open
title: re engine exhaustively explores more than necessary
type: performance
versions: Python 2.6, Python 3.1

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue11307>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to