[issue24426] re.split performance degraded significantly by capturing group

Patrick Maupin Sat, 13 Jun 2015 07:06:06 -0700

Patrick Maupin added the comment:

> (stuff about cPython)


No, I was curious about whether somebody maintained pure-Python fixes (e.g. to 
the re parser and compiler).  Those could be in a regular package that fixed 
some corner cases such as the capture group you just applied a patch for.

> ... regex is more powerful and better supports Unicode.

Unfortunately, it is still not competitive.  For example, for one package I 
maintain (github.com/pmaupin/pdfrw), I have one unittest which reads in and 
parses several PDF files, and then outputs them to new PDF files:

Python 2.7 with re -- 5.9 s
Python 2.7 with regex -- 6.9 s
Python 3.4 with re -- 7.2 s
Python 3.4 with regex -- 8.2 s

A large reason for the difference between 2.7 and 3.4 is the fact that I'm too 
lazy, or it seems too error-prone, to put the b'xxx' in front of every string, 
so the package uses the same source code for 2.7 and 3.4, which means unicode 
strings for 3.4 and byte strings for 2.7.

Nonetheless, when you consider all the other work going on in the package, a 
14% _overall_ slowdown to change to a "better" re package seems like going the 
wrong direction, especially when stacked on top of the 22% slowdown for 
switching to Python3.

> Do you mean documenting codes of compiled re pattern?

Yes.


> This is implementation detail and will be changed in future.

Understood, and that's fine.  If the documentation existed, it would have 
helped if I want to create a pure-python package that simply performed 
optimizations (like the one in your patch) against existing Python 
implementations, for use until regex (which is a HUGE undertaking) is ready.

Thanks,
Pat

----------

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue24426>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue24426] re.split performance degraded significantly by capturing group

Reply via email to