On Thu, 15 Sep 2011 14:54:57 -0400, Terry Reedy wrote: >> I was thinking there might be a technique I could use to evaluate >> regular expressions in a thread or another process launched via >> multiprocessing module and then kill the thread/process after a >> specified timeout period. > > Only solution I remember ever seen posted. I wonder if there are any > heuristics for detecting exponential time re's.
Exponential growth results from non-determinism, i.e. if there are multiple transitions for a given character from a given state. Common patterns include: ...(a...)?a... ...(a...)*a... ...(a...)+a... with a choice between matching the "a" at the start of the bracketed pattern or the "a" following it. (xxxa...|xxxb...|xxxc...) with a choice between branches which cannot be resolved until more data has been read. For re.search (as opposed to re.match): axxxa... When axxx has been read, a following "a" gives a choice between continuing the existing match with the second "a", or aborting and matching against the first "a". For patterns which contain many copies of the initial character, each copy creates another branch. Also, using back-references in a regexp [sic] can be a significant performance killer, as it rules out the use of a DFA (which, IIRC, Python doesn't do anyhow) and can require brute-forcing many combinations. A particularly bad case is: (a*)\1 matching against "aaaaaaaa...". If the performance issue is with re.match/re.search rather than with re.compile, one option is to use ctypes to access libc's regexp functions. Those are likely to provide better searching throughput at the expense of potentially increased compilation time. -- http://mail.python.org/mailman/listinfo/python-list