[issue11454] email.message import time

2013-06-26 Thread Roundup Robot

Roundup Robot added the comment:

New changeset 520490c4c388 by R David Murray in branch 'default':
#11454: Reduce email module load time, improve surrogate check efficiency.
http://hg.python.org/cpython/rev/520490c4c388

--
nosy: +python-dev

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11454
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11454] email.message import time

2013-06-26 Thread R. David Murray

R. David Murray added the comment:

I've checked in the encode version of the method.  I'm going to pass on doing 
the other inlines, given that the improvement isn't that large.  I will, 
however, keep the issue in mind as I make other changes to the code, and there 
will be a general performance review phase when I get done with the API 
additions/bug fixing in the email6 project.

--
resolution:  - fixed
stage: patch review - committed/rejected
status: open - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11454
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11454] email.message import time

2013-03-14 Thread Ezio Melotti

Changes by Ezio Melotti ezio.melo...@gmail.com:


--
stage:  - patch review
versions: +Python 3.4 -Python 3.3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11454
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11454] email.message import time

2012-09-23 Thread R. David Murray

R. David Murray added the comment:

I'm really not willing to inline any of those pre-compiled regular expressions. 
 They are precompiled because for a program processing lots of email, they are 
hot spots.  We could use the same compile on demand dodge on them, though.

Can you explain your changes to the

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11454
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11454] email.message import time

2012-09-23 Thread R. David Murray

R. David Murray added the comment:

Woops.

Can you explain your changes to the ecre regex (keeping in mind that I don't 
know much about regex syntax).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11454
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11454] email.message import time

2012-09-23 Thread R. David Murray

R. David Murray added the comment:

Oh, yeah, and the encode benchmark is very instructive, thanks Serhiy :)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11454
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11454] email.message import time

2012-09-23 Thread Ezio Melotti

Ezio Melotti added the comment:

 They are precompiled because for a program processing lots of email,
 they are hot spots.

OK, I didn't know they were hot spots.  Note that the regex are not recompiled 
everytime: they are compiled the first time and then taken from the cache 
(assuming they don't fall out from the bottom of the cache).  This still has a 
small overhead though.

 Can you explain your changes to the ecre regex (keeping in mind
 that I don't know much about regex syntax).

-  (?Pcharset[^?]*?)   # non-greedy up to the next ? is the charset
+  (?Pcharset[^?]*)# up to the next ? is the charset
   \?# literal ?
   (?Pencoding[qb])# either a q or a b, case insensitive
   \?# literal ?
-  (?Pencoded.*?)  # non-greedy up to the next ?= is the encoded string
+  (?Pencoded[^?]*)# up to the next ?= is the encoded string
   \?=   # literal ?=

At the beginning, the non-greedy *? is unnecessary because [^?]* already stops 
at the first ? found.
The second change might actually be wrong if encoded is allowed to contain 
lone '?'s.  The original regex used '.*?\?=', which means match everything 
(including lone '?'s) until the first '?='), mine means match everything 
until the first '?' which works fine as long as lone '?'s are not allowed.

Serhiy's suggestion is semantically different, but it might be still suitable 
if having _has_surrogate return True even for surrogates not in range 
\udc80-\udcff is OK.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11454
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11454] email.message import time

2012-09-23 Thread R. David Murray

R. David Murray added the comment:

Well, other surrogates will cause a different error later than with the 
current _has_surrogates logic, but it won't be any more mysterious than what 
would happen now, I think.  Normally, if I understand correctly, other 
surrogates should never occur, so I don't think it is a real issue.

Yes, lone '?'s should not stop the pattern match in an encoded string. Even 
though I don't think they are normally supposed to occur, they do occur when 
encoded words are encoded incorrectly, and we get a better error recovery 
result if we look for ?= as the end.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11454
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11454] email.message import time

2012-09-20 Thread Arfrever Frehtes Taifersar Arahesis

Changes by Arfrever Frehtes Taifersar Arahesis arfrever@gmail.com:


--
nosy: +Arfrever

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11454
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11454] email.message import time

2012-09-19 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

 If I change the regex to _has_surrogates = 
 re.compile('[\udc80-\udcff]').search, the tests still pass but there's no 
 improvement on startup time (note: the previous regex was matching all the 
 surrogates in this range too, however I'm not sure how well this is tested).

What about

  _has_surrogates = re.compile('[^\udc80-\udcff]*\Z').match

?

--
nosy: +storchaka

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11454
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11454] email.message import time

2012-09-19 Thread Ezio Melotti

Ezio Melotti added the comment:

 What about _has_surrogates = re.compile('[^\udc80-\udcff]*\Z').match ?

The runtime is a bit slower than re.compile('[\udc80-\udcff]').search, but 
otherwise it's faster than all the other alternatives.  I haven't checked the 
startup-time, but I suspect it won't be better -- maybe even worse.

--
Added file: http://bugs.python.org/file27223/issue11454_surr1.py

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11454
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11454] email.message import time

2012-09-19 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

 I haven't checked the startup-time, but I suspect it won't be better -- maybe 
 even worse.

I suppose it will be much better.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11454
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11454] email.message import time

2012-09-19 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Startup-time:

$ ./python -m timeit  -s 'import re'  
're.compile(([^\ud800-\udbff]|\A)[\udc00-\udfff]([^\udc00-\udfff]|\Z)).search;
 re.purge()'
100 loops, best of 3: 4.16 msec per loop
$ ./python -m timeit  -s 'import re'  're.purge()'  
're.compile([\udc80-\udcff]).search'
100 loops, best of 3: 5.72 msec per loop
$ ./python -m timeit  'h=lambda s, p=set(map(chr, range(0xDC80, 0xDCFF+1))): 
any(c in p for c in s)'
1 loops, best of 3: 60.5 usec per loop
$ ./python -m timeit  -s 'import re'  're.purge()'  
're.compile((?![^\udc80-\udcff])).search'
1000 loops, best of 3: 401 usec per loop
$ ./python -m timeit  -s 'import re'  're.purge()'  
're.compile([^\udc80-\udcff]*\Z).match'
1000 loops, best of 3: 427 usec per loop

Runtime:

$ ./python -m timeit  -s 'import re; 
h=re.compile(([^\ud800-\udbff]|\A)[\udc00-\udfff]([^\udc00-\udfff]|\Z)).search;
 s = A*1000'  'h(s)'
1000 loops, best of 3: 245 usec per loop
$ ./python -m timeit  -s 'import re; h=re.compile([\udc80-\udcff]).search; s 
= A*1000'  'h(s)'
1 loops, best of 3: 30.1 usec per loop
$ ./python -m timeit  -s 'h=lambda s, p=set(map(chr, range(0xDC80, 0xDCFF+1))): 
any(c in p for c in s); s = A*1000'  'h(s)'
1 loops, best of 3: 164 usec per loop
$ ./python -m timeit  -s 'import re; 
h=re.compile((?![^\udc80-\udcff])).search; s = A*1000'  'h(s)'
1 loops, best of 3: 98.3 usec per loop
$ ./python -m timeit  -s 'import re; h=re.compile([^\udc80-\udcff]*\Z).match; 
s = A*1000'  'h(s)'
1 loops, best of 3: 34.6 usec per loop

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11454
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11454] email.message import time

2012-09-19 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Faster set-version:

$ ./python -m timeit  -s 'h=lambda s, hn=set(map(chr, range(0xDC80, 
0xDD00))).isdisjoint: not hn(s); s = A*1000'  'h(s)'
1 loops, best of 3: 43.8 usec per loop

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11454
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11454] email.message import time

2012-09-19 Thread Ezio Melotti

Changes by Ezio Melotti ezio.melo...@gmail.com:


Removed file: http://bugs.python.org/file27223/issue11454_surr1.py

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11454
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11454] email.message import time

2012-09-19 Thread Ezio Melotti

Changes by Ezio Melotti ezio.melo...@gmail.com:


Removed file: http://bugs.python.org/file27203/issue11454_surr1.py

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11454
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11454] email.message import time

2012-09-19 Thread Ezio Melotti

Ezio Melotti added the comment:

Attached new benchmark file.

Results:

Testing runtime of the _has_surrogates functions
Generating chars...
Generating samples...
  1.61 - re.compile(current_regex).search
  0.24 - re.compile(short_regex).search
 15.13 - return any(c in surrogates for c in s)
 10.21 - for c in s: if c in surrogates: return True
  0.85 - return re.search(short_regex, s)
  0.83 - functools.partial(re.search, short_regex)
 20.86 - for c in map(ord, s): if c in range(0xDC80, 0xDCFF+1): return True
 19.68 - for c in map(ord, s): if 0xDC80 = c = 0xDCFF: return True
  0.28 - re.compile('[^\udc80-\udcff]*\Z').match
  7.00 - return not set(map(chr, range(0xDC80, 0xDCFF+1))).isdisjoint(s)

Testing startup time
  0.57 - r = re.compile('[\udc80-\udcff]').search
  0.59 - r = re.compile('[^\udc80-\udcff]*\Z').match
199.79 - r = re.compile('[\udc80-\udcff]').search; purge()
 22.62 - r = re.compile('[^\udc80-\udcff]*\Z').match; purge()
  1.12 - r = pickle.loads(p)

--
Added file: http://bugs.python.org/file27225/issue11454_benchmarks.py

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11454
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11454] email.message import time

2012-09-19 Thread R. David Murray

R. David Murray added the comment:

So by your measurements the short search is the clear winner?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11454
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11454] email.message import time

2012-09-19 Thread Ezio Melotti

Ezio Melotti added the comment:

Yes, however it has a startup cost that the function that returns 
re.search(short_regex, s) and the one with functool.partial don't have, because 
with these the compilation happens at the first call.

If we use one of these two, the startup time will be reduced a lot, and the 
runtime will be ~2x faster.
If we use re.compile(short_regex).search the startup time won't be reduced as 
much, but the runtime will be ~8x faster.

Given that here we are trying to reduce the startup time and not the runtime, I 
think using one of those two functions is better.

Another possible solution to improve the startup time is trying to optimize 
_optimize_unicode -- not sure how much can be done there though.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11454
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11454] email.message import time

2012-09-19 Thread R. David Murray

R. David Murray added the comment:

This issue may be about reducing the startup time, but this function is a hot 
spot in the email package so I would prefer to sacrifice startup time 
optimization for an increase in speed.

However, given the improvements to import locking in 3.3, what about a self 
replacing function?

def _has_surrogates(s):
import email.utils
f = re.compile('[\udc80-\udcff]').search
email.utils._has_surrogates = f
return f(s)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11454
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11454] email.message import time

2012-09-19 Thread Ezio Melotti

Ezio Melotti added the comment:

That might work.

To avoid the overhead of the cache lookup I was thinking about something like

regex = None
def _has_surrogates(s):
global regex
if regex is None:
regex = re.compile(short_regex)
return regex.search(s)

but I have discarded it because it's not very pretty and still has the overhead 
of the function and an additional if.  Your version solves both the problems in 
a more elegant way.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11454
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11454] email.message import time

2012-09-19 Thread R. David Murray

R. David Murray added the comment:

It passed the email test suite.  Patch attached.

--
Added file: http://bugs.python.org/file27226/email_import_speedup.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11454
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11454] email.message import time

2012-09-19 Thread Ezio Melotti

Ezio Melotti added the comment:

It would be better to add/improve the _has_surrogates tests before committing.
The patch I attached is also still valid if you want a further speed up 
improvement.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11454
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11454] email.message import time

2012-09-19 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

def _has_surrogates(s):
try:
s.encode()
return False
except UnicodeEncodeError:
return True

Results:
0.26 - re.compile(short_regex).search
0.06 - try encode

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11454
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11454] email.message import time

2012-09-18 Thread Ezio Melotti

Ezio Melotti added the comment:

re.compile seems twice as fast as pickle.loads:


import re
import pickle
import timeit

N = 10
s = r = re.compile('[\\udc80-\\udcff]')
t = timeit.Timer(s, 'import re')
print(%6.2f - re.compile % t.timeit(number=N))

s = r = pickle.loads(p)
p = pickle.dumps(re.compile('[\udc80-\udcff]'))
t = timeit.Timer(s, 'import pickle; from __main__ import p')
print(%6.2f - pickle.loads % t.timeit(number=N))


Result:
  5.59 - re.compile
 11.04 - pickle.loads


See also #2679.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11454
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11454] email.message import time

2012-09-16 Thread R. David Murray

R. David Murray added the comment:

Considering how often that test is done, I would consider the compiled version 
of the short regex the clear winner based on your numbers.  I wonder if we 
could precompile the regex and load it from a pickle.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11454
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11454] email.message import time

2012-09-15 Thread Ezio Melotti

Ezio Melotti added the comment:

I tried to remove a few unused regex and inline some of the others (the re 
module has its own caching anyway and they don't seem to be documented), but it 
didn't get so much faster (see attached patch).  

I then put the second list of email imports of the previous message in a file 
and run it with cprofile and these are the results:

=== Without patch ===

$ time ./python -m issue11454_imp2
[69308 refs]

real0m0.337s
user0m0.312s
sys 0m0.020s

$ ./python -m cProfile -s time issue11454_imp2.py
 15130 function calls (14543 primitive calls) in 0.191 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   260.0290.0010.0290.001 {built-in method loads}
 12480.0150.0000.0180.000 sre_parse.py:184(__next)
30.0100.0030.0150.005 
sre_compile.py:301(_optimize_unicode)
48/170.0090.0000.0370.002 sre_parse.py:418(_parse)
 30/10.0080.0000.1910.191 {built-in method exec}
   820.0070.0000.0240.000 {built-in method __build_class__}
   250.0060.0000.0240.001 
sre_compile.py:207(_optimize_charset)
80.0050.0010.0050.001 {built-in method load_dynamic}
 11220.0050.0000.0220.000 sre_parse.py:209(get)
  1770.0050.0000.0050.000 {built-in method stat}
  1070.0050.0000.0120.000 frozen 
importlib._bootstrap:1350(find_loader)
2944/29190.0040.0000.0040.000 {built-in method len}
69/150.0030.0000.0280.002 sre_compile.py:32(_compile)
90.0030.0000.0030.000 sre_compile.py:258(_mk_bitmap)
   940.0020.0000.0030.000 frozen 
importlib._bootstrap:74(_path_join)


=== With patch ===

$ time ./python -m issue11454_imp2
[69117 refs]

real0m0.319s
user0m0.304s
sys 0m0.012s

$ ./python -m cProfile -s time issue11454_imp2.py
 11281 function calls (10762 primitive calls) in 0.162 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   210.0220.0010.0220.001 {built-in method loads}
30.0110.0040.0150.005 
sre_compile.py:301(_optimize_unicode)
  7080.0080.0000.0100.000 sre_parse.py:184(__next)
 30/10.0080.0000.2380.238 {built-in method exec}
   820.0070.0000.0230.000 {built-in method __build_class__}
  1870.0050.0000.0050.000 {built-in method stat}
80.0050.0010.0050.001 {built-in method load_dynamic}
  1070.0050.0000.0120.000 frozen 
importlib._bootstrap:1350(find_loader)
 29/80.0050.0000.0200.002 sre_parse.py:418(_parse)
   110.0040.0000.0200.002 
sre_compile.py:207(_optimize_charset)
  6430.0030.0000.0120.000 sre_parse.py:209(get)
50.0030.0010.0030.001 {built-in method dumps}
   940.0020.0000.0030.000 frozen 
importlib._bootstrap:74(_path_join)
  2570.0020.0000.0020.000 quoprimime.py:56(genexpr)
   260.0020.0000.1160.004 frozen 
importlib._bootstrap:938(get_code)
1689/16760.0020.0000.0020.000 {built-in method len}
   310.0020.0000.0030.000 frozen 
importlib._bootstrap:1034(get_data)
  2560.0020.0000.0020.000 {method 'setdefault' of 'dict' 
objects}
  1190.0020.0000.0030.000 frozen 
importlib._bootstrap:86(_path_split)
   350.0020.0000.0190.001 frozen 
importlib._bootstrap:1468(_find_module)
   340.0020.0000.0150.000 frozen 
importlib._bootstrap:1278(_get_loader)
 39/60.0020.0000.0230.004 sre_compile.py:32(_compile)
 26/30.0010.0000.2350.078 frozen 
importlib._bootstrap:853(_load_module)


The time spent in sre_compile.py:301(_optimize_unicode) most likely comes from 
email.utils._has_surrogates (there's a further speedup when it's commented 
away):
_has_surrogates = 
re.compile('([^\ud800-\udbff]|\A)[\udc00-\udfff]([^\udc00-\udfff]|\Z)').search

This is used in a number of places, so it can't be inlined.  I wanted to 
optimize it but I'm not sure what it's supposed to do.  It matches lone low 
surrogates, but not lone high ones, and matches some invalid sequences, but not 
others:
 _has_surrogates('\ud800')  # lone high
 _has_surrogates('\udc00')  # lone low
_sre.SRE_Match object at 0x9ae00e8
 _has_surrogates('\ud800\udc00')  # valid pair (high+low)
 _has_surrogates('\ud800\ud800\udc00')  # invalid sequence (lone high, valid 
 high+low)
 _has_surrogates('\udc00\ud800\ud800\udc00')  # invalid sequence (lone low, 
 lone high, valid high+low)
_sre.SRE_Match object at 0x9ae0028

FWIW this was 

[issue11454] email.message import time

2012-09-15 Thread R. David Murray

R. David Murray added the comment:

It detects whether a string contains any characters have been surrogate escaped 
by the surrogate escape handler.  I disliked using it, but I didn't know of any 
better way to do that detection.  It's on my long list of things to come back 
to eventually and try to improve :)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11454
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11454] email.message import time

2012-09-15 Thread Ezio Melotti

Ezio Melotti added the comment:

Given that high surrogates are U+D800..U+DBFF, and low ones are U+DC00..U+DFFF, 
'([^\ud800-\udbff]|\A)[\udc00-\udfff]([^\udc00-\udfff]|\Z)' means a low 
surrogates, preceded by either an high one or line beginning, and followed by 
another low one or line end.

PEP 838 says With this PEP, non-decodable bytes = 128 will be represented as 
lone surrogate codes U+DC80..U+DCFF.

If I change the regex to _has_surrogates = 
re.compile('[\udc80-\udcff]').search, the tests still pass but there's no 
improvement on startup time (note: the previous regex was matching all the 
surrogates in this range too, however I'm not sure how well this is tested).

If I change the implementation with
_pep383_surrogates = set(map(chr, range(0xDC80, 0xDCFF+1)))
def _has_surrogates(s):
return any(c in _pep383_surrogates for c in s)

the tests still pass and the startup is ~15ms faster here:

$ time ./python -m issue11454_imp2
[68837 refs]

real0m0.305s
user0m0.288s
sys 0m0.012s

However using this function instead of the regex is ~10x slower at runtime.  
Using the shorter regex is about ~7x faster, but there are no improvements on 
the startup time.
Assuming the shorter regex is correct, it can still be called inside a function 
or used with functools.partial.  This will result in a improved startup time 
and a ~2x improvement on runtime (so it's a win-win).
See attached patch for benchmarks.

This is a sample result:
 17.01 usec/pass  - re.compile(current_regex).search
  2.20 usec/pass  - re.compile(short_regex).search
148.18 usec/pass  - return any(c in surrogates for c in s)
106.35 usec/pass  - for c in s: if c in surrogates: return True
  8.40 usec/pass  - return re.search(short_regex, s)
  8.20 usec/pass  - functools.partial(re.search, short_regex)

--
Added file: http://bugs.python.org/file27203/issue11454_surr1.py

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11454
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11454] email.message import time

2011-03-11 Thread Ross Lagerwall

Changes by Ross Lagerwall rosslagerw...@gmail.com:


--
title: urllib.request import time - email.message import time

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11454
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com