New submission from Serhiy Storchaka:

Matching and searching case-insensitive regular expressions is much slower than 
matching and searching case-sensitive regular expressions. Case-insensitivity 
requires converting every character in input string to lower case and disables 
some optimizations. But there are only 2669 cased characters (52 in ASCII 
mode). For all other characters in the pattern we can use case-sensitive 
matching.

Results of microbenchmarks:

$ ./python -m timeit -s "import re; p = re.compile(r'(?ai) +x'); s = ' 
'*10000+'x'"  "p.match(s)"
Unpatched:  5000 loops, best of 5: 47.1 usec per loop
Patched:    20000 loops, best of 5: 17.6 usec per loop

$ ./python -m timeit -s "import re; p = re.compile(r'(?i) +x'); s = ' 
'*10000+'x'"  "p.match(s)"
Unpatched:  2000 loops, best of 5: 109 usec per loop
Patched:    20000 loops, best of 5: 17.6 usec per loop

$ ./python -m timeit -s "import re; p = re.compile(r'(?ai)[ \t]+x'); s = ' 
'*10000+'x'"  "p.match(s)"
Unpatched:  500 loops, best of 5: 537 usec per loop
Patched:    5000 loops, best of 5: 84.2 usec per loop

$ ./python -m timeit -s "import re; p = re.compile(r'(?i)[ \t]+x'); s = ' 
'*10000+'x'"  "p.match(s)"
Unpatched:  500 loops, best of 5: 608 usec per loop
Patched:    5000 loops, best of 5: 84.1 usec per loop

$ ./python -m timeit -s "import re; p = re.compile(r'(?ai)/x'); s = ' 
'*10000+'/x'"  "p.search(s)"
Unpatched:  1000 loops, best of 5: 226 usec per loop
Patched:    20000 loops, best of 5: 13.4 usec per loop

$ ./python -m timeit -s "import re; p = re.compile(r'(?i)/x'); s = ' 
'*10000+'/x'"  "p.search(s)"
Unpatched:  1000 loops, best of 5: 284 usec per loop
Patched:    20000 loops, best of 5: 13.4 usec per loop

$ ./python -m timeit -s "import re; p = re.compile(r'(?ai)[/%]x'); s = ' 
'*10000+'/x'"  "p.search(s)"
Unpatched:  500 loops, best of 5: 483 usec per loop
Patched:    1000 loops, best of 5: 279 usec per loop

$ ./python -m timeit -s "import re; p = re.compile(r'(?i)[/%]x'); s = ' 
'*10000+'/x'"  "p.search(s)"
Unpatched:  500 loops, best of 5: 549 usec per loop
Patched:    1000 loops, best of 5: 279 usec per loop

The patch also optimizes searching some complex case-sensitive patterns. This 
is a side effect, I'll provide more comprehensive optimization in other issue.

$ ./python -m timeit -s "import re; p = re.compile(r'([xy])'); s = ' 
'*10000+'x'"  "p.search(s)"
Unpatched:  500 loops, best of 5: 633 usec per loop
Patched:    1000 loops, best of 5: 250 usec per loop

$ ./python -m timeit -s "import re; p = re.compile(r'((x|y)z)'); s = ' 
'*10000+'xz'"  "p.search(s)"
Unpatched:  500 loops, best of 5: 716 usec per loop
Patched:    1000 loops, best of 5: 250 usec per loop

----------
assignee: serhiy.storchaka
components: Library (Lib), Regular Expressions
messages: 293127
nosy: ezio.melotti, mrabarnett, serhiy.storchaka
priority: normal
severity: normal
stage: patch review
status: open
title: Optimize case-insensitive regular expressions
type: performance
versions: Python 3.7

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30285>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to