[issue24426] re.split performance degraded significantly by capturing group

2015-06-21 Thread Serhiy Storchaka
Changes by Serhiy Storchaka storch...@gmail.com: -- resolution: - fixed stage: patch review - resolved status: open - closed ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue24426 ___

[issue24426] re.split performance degraded significantly by capturing group

2015-06-21 Thread Roundup Robot
Roundup Robot added the comment: New changeset 7e46a503dd16 by Serhiy Storchaka in branch 'default': Issue #24426: Fast searching optimization in regular expressions now works https://hg.python.org/cpython/rev/7e46a503dd16 -- nosy: +python-dev ___

[issue24426] re.split performance degraded significantly by capturing group

2015-06-13 Thread Patrick Maupin
Patrick Maupin added the comment: (stuff about cPython) No, I was curious about whether somebody maintained pure-Python fixes (e.g. to the re parser and compiler). Those could be in a regular package that fixed some corner cases such as the capture group you just applied a patch for. ...

[issue24426] re.split performance degraded significantly by capturing group

2015-06-13 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: This is a reason to file a feature request to regex. In 3.3 re was slower than regex in some cases: $ ./python -m timeit -s import re; p = re.compile('\n\r'); s = ('a'*100 + '\n\r')*1000 -- p.split(s) Python 3.3 re : 1000 loops, best of 3: 952 usec per

[issue24426] re.split performance degraded significantly by capturing group

2015-06-13 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: 1) Do you know if anybody maintains a patched version of the Python code anywhere? I could put a package up on github/PyPI, if not. Sorry, perhaps I misunderstood you. There are unofficial mirrors of CPython on Bitbucket [1] and GitHub [2]. They don't

[issue24426] re.split performance degraded significantly by capturing group

2015-06-11 Thread Patrick Maupin
Patrick Maupin added the comment: Thank you for the quick response, Serhiy. I had started investigating and come to the conclusion that it was a problem with the compiler rather than the C engine. Interestingly, my next step was going to be to use names for the compiler constants, and then

[issue24426] re.split performance degraded significantly by capturing group

2015-06-10 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Splitting with pattern '\n(?=(\n))' produces the same result as with pattern '(\n)' and is as fast as with pattern '\n'. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue24426

[issue24426] re.split performance degraded significantly by capturing group

2015-06-10 Thread Patrick Maupin
New submission from Patrick Maupin: The addition of a capturing group in a re.split() pattern, e.g. using '(\n)' instead of '\n', causes a factor of 10 performance degradation. I use re.split a() lot, but never noticed the issue before. It was extremely noticeable on 1000 patterns in a 5BG

[issue24426] re.split performance degraded significantly by capturing group

2015-06-10 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Regular expression is optimized for the case when it starts with constant string or charset. It is no degradation when using '(\n)', but rather an optimization of '\n'. -- ___ Python tracker

[issue24426] re.split performance degraded significantly by capturing group

2015-06-10 Thread Patrick Maupin
Patrick Maupin added the comment: 1) I have obviously oversimplified my test case, to the point where a developer thinks I'm silly enough to reach for the regex module just to split on a linefeed. 2) '\n(?=(\n))' -- yes, of course, any casual user of the re module would immediately choose

[issue24426] re.split performance degraded significantly by capturing group

2015-06-10 Thread Ezio Melotti
Changes by Ezio Melotti ezio.melo...@gmail.com: -- nosy: +serhiy.storchaka ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue24426 ___ ___

[issue24426] re.split performance degraded significantly by capturing group

2015-06-10 Thread Patrick Maupin
Patrick Maupin added the comment: Just to be perfectly clear, this is no exaggeration: My original file was slightly over 5GB. I have approximately 1050 bad strings in it, averaging around 11 characters per string. If I split it without capturing those 1050 strings, it takes 3.7 seconds. If