Changes by Serhiy Storchaka storch...@gmail.com:
--
resolution: - fixed
stage: patch review - resolved
status: open - closed
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24426
___
Roundup Robot added the comment:
New changeset 7e46a503dd16 by Serhiy Storchaka in branch 'default':
Issue #24426: Fast searching optimization in regular expressions now works
https://hg.python.org/cpython/rev/7e46a503dd16
--
nosy: +python-dev
___
Patrick Maupin added the comment:
(stuff about cPython)
No, I was curious about whether somebody maintained pure-Python fixes (e.g. to
the re parser and compiler). Those could be in a regular package that fixed
some corner cases such as the capture group you just applied a patch for.
...
Serhiy Storchaka added the comment:
This is a reason to file a feature request to regex. In 3.3 re was slower than
regex in some cases:
$ ./python -m timeit -s import re; p = re.compile('\n\r'); s = ('a'*100 +
'\n\r')*1000 -- p.split(s)
Python 3.3 re : 1000 loops, best of 3: 952 usec per
Serhiy Storchaka added the comment:
1) Do you know if anybody maintains a patched version of the Python code
anywhere? I could put a package up on github/PyPI, if not.
Sorry, perhaps I misunderstood you. There are unofficial mirrors of CPython on
Bitbucket [1] and GitHub [2]. They don't
Patrick Maupin added the comment:
Thank you for the quick response, Serhiy. I had started investigating and come
to the conclusion that it was a problem with the compiler rather than the C
engine. Interestingly, my next step was going to be to use names for the
compiler constants, and then
Serhiy Storchaka added the comment:
Splitting with pattern '\n(?=(\n))' produces the same result as with pattern
'(\n)' and is as fast as with pattern '\n'.
--
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24426
New submission from Patrick Maupin:
The addition of a capturing group in a re.split() pattern, e.g. using '(\n)'
instead of '\n', causes a factor of 10 performance degradation.
I use re.split a() lot, but never noticed the issue before. It was extremely
noticeable on 1000 patterns in a 5BG
Serhiy Storchaka added the comment:
Regular expression is optimized for the case when it starts with constant
string or charset. It is no degradation when using '(\n)', but rather an
optimization of '\n'.
--
___
Python tracker
Patrick Maupin added the comment:
1) I have obviously oversimplified my test case, to the point where a developer
thinks I'm silly enough to reach for the regex module just to split on a
linefeed.
2) '\n(?=(\n))' -- yes, of course, any casual user of the re module would
immediately choose
Changes by Ezio Melotti ezio.melo...@gmail.com:
--
nosy: +serhiy.storchaka
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24426
___
___
Patrick Maupin added the comment:
Just to be perfectly clear, this is no exaggeration:
My original file was slightly over 5GB.
I have approximately 1050 bad strings in it, averaging around 11 characters per
string.
If I split it without capturing those 1050 strings, it takes 3.7 seconds.
If
12 matches
Mail list logo