New submission from Serhiy Storchaka:

Before PEP 393 the regex functions scanned an array of char or Py_UNICODE and 
character testing was cheap. After PEP 393 they checks a kind of an unicode 
string for every tested character and processing of unicode strings becomes 
slower. _sre.c already generates two sets of functions from one source -- for 
byte and unicode strings. The proposed patch uses same technique to generate 
three sets of functions -- for byte/UCS1, UCS2 and UCS4 strings. This 
simplifies the code (now it more similar to pre-PEP393 version) and makes 
characters testing faster.

Benchmark example:

Python 3.2:
$ python3.2 -m timeit -s "import re; f = re.compile(b'abc').search; x = 
b'x'*100000"  "f(x)"
1000 loops, best of 3: 613 usec per loop
$ python3.2 -m timeit -s "import re; f = re.compile('abc').search; x = 
'x'*100000"  "f(x)"
1000 loops, best of 3: 232 usec per loop
$ python3.2 -m timeit -s "import re; f = re.compile('abc').search; x = 
'\u20ac'*100000"  "f(x)"
1000 loops, best of 3: 217 usec per loop

Python 3.4.0a1+ unpatched:
$ ./python -m timeit -s "import re; f = re.compile(b'abc').search; x = 
b'x'*100000"  "f(x)"
1000 loops, best of 3: 485 usec per loop
$ ./python -m timeit -s "import re; f = re.compile('abc').search; x = 
'x'*100000"  "f(x)"
1000 loops, best of 3: 790 usec per loop
$ ./python -m timeit -s "import re; f = re.compile('abc').search; x = 
'\u20ac'*100000"  "f(x)"
1000 loops, best of 3: 1.09 msec per loop

Python 3.4.0a1+ patched:
$ ./python -m timeit -s "import re; f = re.compile(b'abc').search; x = 
b'x'*100000"  "f(x)"
1000 loops, best of 3: 250 usec per loop
$ ./python -m timeit -s "import re; f = re.compile('abc').search; x = 
'x'*100000"  "f(x)"
1000 loops, best of 3: 250 usec per loop
$ ./python -m timeit -s "import re; f = re.compile('abc').search; x = 
'\u20ac'*100000"  "f(x)"
1000 loops, best of 3: 256 usec per loop

I also propose for simplicity extract a template part of _sre.c to separated 
file (i.e. srelib.h) and get rid of recursion.

----------
assignee: serhiy.storchaka
components: Regular Expressions, Unicode
files: sre_optimize.patch
keywords: patch
messages: 194669
nosy: ezio.melotti, mrabarnett, serhiy.storchaka
priority: normal
severity: normal
stage: patch review
status: open
title: Restore re performance to pre-PEP393 level
type: performance
versions: Python 3.4
Added file: http://bugs.python.org/file31198/sre_optimize.patch

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue18685>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to