[issue27898] regexp performance degradation between 2.7.6 and 2.7.12

2020-10-19 Thread Gregory P. Smith


Gregory P. Smith  added the comment:

2.7 is end of life.

if you have regular expression performance issues with something in 3, please 
open a new issue.

--
nosy: +gregory.p.smith
resolution:  -> wont fix
stage:  -> resolved
status: pending -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue27898] regexp performance degradation between 2.7.6 and 2.7.12

2019-09-09 Thread Serhiy Storchaka


Change by Serhiy Storchaka :


--
status: open -> pending

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue27898] regexp performance degradation between 2.7.6 and 2.7.12

2016-09-02 Thread Steve Newcomb

Steve Newcomb added the comment:

On 08/30/2016 12:46 PM, Raymond Hettinger wrote:
> Raymond Hettinger added the comment:
>
> It would be helpful if you ... make a small set of regular expressions that 
> demonstrate the performance regression.
>
Done.  Attachments:

test.py : Code that exercises re.sub() and outputs a profile report.

test_output_2.7.6.txt : Output of test.py under Python 2.7.6.

test_output_2.7.12.txt : Output of test.py under Python 2.7.12.

p17.188.htm -- test data: public information from the U.S. Internal 
Revenue Service.

Equivalent hardware was used in both cases.

The outputs show that 2.7.12's re.sub() takes 1.2 times as long as 
2.7.6's.  It's a significant difference, but...

...it was not the dramatic degradation I expected to find in this 
exercise.  Therefore I attempted to tease what I was looking for out of 
the profile stats I already uploaded to this site, made from actual 
production runs.  My attempts are all found in an hg repository that can 
be downloaded from 
sftp://s...@coolheads.com//files/py-re-perform-276-2712 using password 
bysIe20H .

I do not feel the latter work took me where I wanted to go, and I think 
the reason is that, at least for purposes of our application, Python 
2.7.12 has been so extensively refactored since Python 2.7.6.  So it's 
an apples-to-oranges comparison, apparently.  Still, the performance 
difference for re.sub() is quite dramatic , and re.sub() is the only 
comparable function whose performance dramatically worsened: in our 
application, 2.7.12's re.sub() takes 3.04 times as long as 2.7.6's.

The good news, of course, is that by and large the performance of the 
other *comparable* functions largely improved, often dramatically.  But 
at least in our application, it doesn't come close to making up for the 
degradation in re.sub().

My by-the-gut bottom line: somebody who really knows the re module 
should take a deep look at re.sub().  Why would re.sub(), unlike all 
others, take so much longer to run, while *every* other function in the 
re module get (often much) faster?  It feels like there's a bug 
somewhere in re.sub().

Steve Newcomb

--
Added file: http://bugs.python.org/file44335/test.py
Added file: http://bugs.python.org/file44336/test_output_2.7.6.txt
Added file: http://bugs.python.org/file44337/p17-188.htm
Added file: http://bugs.python.org/file44338/test_output_2.7.12.txt

___
Python tracker 

___#!/usr/bin/env python2

import codecs, profile, os, re, sys

hrefRE = re.compile(
''.join(
[
r'href=',
r'(?P["\'])',
r'(?P',
r'.*?',
r')',
r'(?=quote)',
],
),
)
###
onePathSegmentMS = ''.join(
[
r'(?P<_pathSeg>',
r'(',
r'/?',
r'(',
r'(?!',
r'[ \t\r\n]+',
r'$',
r')',
u'[^%s]' % ( re.escape( r'/?#')),
r')+',
r'|',
r'/',
r')',
r')',
],
)
onePathSegmentRE = re.compile( onePathSegmentMS)

###
uriMS = r''.join(
(
r'(?P',  ## leading whitespace is OK and ignorable; 
see http://dev.w3.org/html5/spec-LC/urls.html
r'[ \t\r\n]+',
r')?',
r'(',
r'(?P',
r'https?',  
r')',
r':\/{0,2}',   ## accounts for encountered error: only 0 or 1 slash 
instead of 2
r')?',
r'(?P',
r'(?P',
r'(',
r'(?P<_userinfo>',
r'[^%s]+' % ( re.escape( r'@/[:?#')),
r')',
re.escape( '@'),
r')?',
r')',
r'(?P',
r'(?P',
re.escape( r'['),
r')?',
r'(',
r'(?P',
r'(',
r'[0-9]{1,3}%s' % ( re.escape( r'.')),
r'){3}',
r'[0-9]{1,3}',
r')',
r'|',
r'(?P',
r'(',
r'[0-9A-Fa-f]{0,4}%s' % ( re.escape( ':')),
r'){1,7}',
r'[0-9A-Fa-f]{0,4}',
r')',
r'|',
r'(?P',
r'(',
r'[^%s]+?' % ( re.escape( r']:/?#')),  ## this may 
have dots
r'\.',
r')+',
r'(?P',  ## top-level domain, e.g. "com", "gov" 
etc.
r'(',

[issue27898] regexp performance degradation between 2.7.6 and 2.7.12

2016-09-01 Thread Steve Newcomb

Steve Newcomb added the comment:

On 09/01/2016 05:01 PM, Steve Newcomb wrote:
>
>> The outputs show that 2.7.12's re.sub() takes 1.2 times as long as 
>> 2.7.6's.  It's a significant difference, but...
>>
>> ...it was not the dramatic degradation I expected to find in this 
>> exercise.
On second (third?) thought, the degree of degradation could easily 
depend on the source data being processed.  Maybe test.py does, in fact, 
demonstrate the problem, but the test data I used (p17-118.htm) do not 
demonstrate a terribly severe case of the problem.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue27898] regexp performance degradation between 2.7.6 and 2.7.12

2016-09-01 Thread STINNER Victor

Changes by STINNER Victor :


--
nosy: +haypo
type:  -> performance

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue27898] regexp performance degradation between 2.7.6 and 2.7.12

2016-09-01 Thread Steve Newcomb

Steve Newcomb added the comment:

Oops.  The correct url is sftp://coolheads.com/files/py-re-perform-276v2712/

On 09/01/2016 04:52 PM, Steve Newcomb wrote:
> On 08/30/2016 12:46 PM, Raymond Hettinger wrote:
>> Raymond Hettinger added the comment:
>>
>> It would be helpful if you ... make a small set of regular 
>> expressions that demonstrate the performance regression.
>>
> Done.  Attachments:
>
> test.py : Code that exercises re.sub() and outputs a profile report.
>
> test_output_2.7.6.txt : Output of test.py under Python 2.7.6.
>
> test_output_2.7.12.txt : Output of test.py under Python 2.7.12.
>
> p17.188.htm -- test data: public information from the U.S. Internal 
> Revenue Service.
>
> Equivalent hardware was used in both cases.
>
> The outputs show that 2.7.12's re.sub() takes 1.2 times as long as 
> 2.7.6's.  It's a significant difference, but...
>
> ...it was not the dramatic degradation I expected to find in this 
> exercise.  Therefore I attempted to tease what I was looking for out 
> of the profile stats I already uploaded to this site, made from actual 
> production runs.  My attempts are all found in an hg repository that 
> can be downloaded from 
> sftp://s...@coolheads.com//files/py-re-perform-276-2712 using password 
> bysIe20H .
>
> I do not feel the latter work took me where I wanted to go, and I 
> think the reason is that, at least for purposes of our application, 
> Python 2.7.12 has been so extensively refactored since Python 2.7.6.  
> So it's an apples-to-oranges comparison, apparently.  Still, the 
> performance difference for re.sub() is quite dramatic , and re.sub() 
> is the only comparable function whose performance dramatically 
> worsened: in our application, 2.7.12's re.sub() takes 3.04 times as 
> long as 2.7.6's.
>
> The good news, of course, is that by and large the performance of the 
> other *comparable* functions largely improved, often dramatically.  
> But at least in our application, it doesn't come close to making up 
> for the degradation in re.sub().
>
> My by-the-gut bottom line: somebody who really knows the re module 
> should take a deep look at re.sub().  Why would re.sub(), unlike all 
> others, take so much longer to run, while *every* other function in 
> the re module get (often much) faster?  It feels like there's a bug 
> somewhere in re.sub().
>
> Steve Newcomb
>

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue27898] regexp performance degradation between 2.7.6 and 2.7.12

2016-08-30 Thread Steve Newcomb

Steve Newcomb added the comment:

On 08/30/2016 01:24 PM, Serhiy Storchaka wrote:
> Serhiy Storchaka added the comment:
>
> According to your profile results all re functions are 2.5-4 times faster 
> under 2.7.12 than under 2.7.6. May be I misinterpret it?
I can't explain the profiler's report.  I'm kind of glad that you, too, 
find it baffling.  Is it possible that the profiler doesn't actually 
work predictably in the multiprocessing context?  If so, one thing I can 
*easily* do is to disable multiprocessing in that code and see what the 
profiler reports are then.  It will take all night, but I'm beginning to 
think it would be worthwhile, because it might point the finger of blame 
at either the multiprocessing module or the re module, but not both at once.

(I originally provided a "disable multiprocessing" capability in that 
code in order to use the Python debugger with it.  It would kind of make 
sense if the profiler had limitations similar to those of the debugger.)
>
> Note that 96-99% of time (2847.099 of 2980.718 seconds under 2.7.6 and 
> 4474.890 of 4519.872 seconds under 2.7.12) is spent in posix.waitpid. The 
> rest of time is larger under 2.7.6 (2980.718 - 2847.099 = 133.619) than under 
> 2.7.12 (4519.872 - 4474.890 = 44.982).
Yeah, I'm beginning to wonder if those strange statistics, too, are 
artifacts of using a single-process profiler in a multiprocessing context.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue27898] regexp performance degradation between 2.7.6 and 2.7.12

2016-08-30 Thread Steve Newcomb

Steve Newcomb added the comment:

On 08/30/2016 12:46 PM, Raymond Hettinger wrote:
> Raymond Hettinger added the comment:
>
> It would be helpful if you could run "hg bisect" with your set-up to isolate 
> the change that causes the problem.
I don't think I understand you.  There's no difference in the Python 
code we're using in both cases.  The only differences, AFAIK, are in the 
Python interpreter and in the Linux distribution.  I'm not qualified to 
analyze the differences in the latter items.
>Alternatively, make a small set of regular expressions that demonstrate 
> the performance regression.
It will be hard to do that, because the code is so complex, and because 
debugging in the multiprocessing context is so hairy. Still, it's the 
only approach I can think of, too.  Sigh.  I'm thinking about it.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue27898] regexp performance degradation between 2.7.6 and 2.7.12

2016-08-30 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

According to your profile results all re functions are 2.5-4 times faster under 
2.7.12 than under 2.7.6. May be I misinterpret it?

Note that 96-99% of time (2847.099 of 2980.718 seconds under 2.7.6 and 4474.890 
of 4519.872 seconds under 2.7.12) is spent in posix.waitpid. The rest of time 
is larger under 2.7.6 (2980.718 - 2847.099 = 133.619) than under 2.7.12 
(4519.872 - 4474.890 = 44.982).

--
nosy: +serhiy.storchaka

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue27898] regexp performance degradation between 2.7.6 and 2.7.12

2016-08-30 Thread Raymond Hettinger

Raymond Hettinger added the comment:

It would be helpful if you could run "hg bisect" with your set-up to isolate 
the change that causes the problem.  Alternatively, make a small set of regular 
expressions that demonstrate the performance regression.

--
nosy: +rhettinger

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue27898] regexp performance degradation between 2.7.6 and 2.7.12

2016-08-30 Thread Steve Newcomb

New submission from Steve Newcomb:

Our most regular-expression-processing-intensive Python 2.7 code takes 2.5x 
more execution time in 2.7.12 than it did in 2.7.6.  I discovered this after 
upgrading from Ubuntu 14.04 to Ubuntu 16.04.  Basically this code runs 
thousands of compiled regular expressions on thousands of texts.  Both the 
multiprocessing module and the re module are heavily used.

See attached profiler outputs, which look quite different in several respects.  
I used the profiling module to profile the same Python code, processing the 
same data, using the same hardware, under both Ubuntu 14.04 (Python 2.7.6) and 
Ubuntu 16.04 (Python 2.7.12).  

It is striking, for example, that cPickle.load appears so prominently in the 
2.7.12 profile -- a fact which appears to implicate the multiprocessing module 
somehow.  But I suspect that the re module is more likely the main source of 
the problem, because the execution times of other production steps -- steps 
that do not call the multiprocessing module -- also appear to be extended to a 
degree that is roughly proportional to the amount of regular expression 
processing done in those other steps.

I will happily provide any further information I can.  Any insights about this 
surprisingly severe performance degradation would be welcome.

--
files: profiles_2.7.6_vs_2.7.12
messages: 273932
nosy: steve.newcomb
priority: normal
severity: normal
status: open
title: regexp performance degradation between 2.7.6 and 2.7.12
Added file: http://bugs.python.org/file44277/profiles_2.7.6_vs_2.7.12

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com