[issue11454] email.message import time
Roundup Robot added the comment: New changeset 520490c4c388 by R David Murray in branch 'default': #11454: Reduce email module load time, improve surrogate check efficiency. http://hg.python.org/cpython/rev/520490c4c388 -- nosy: +python-dev ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11454 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11454] email.message import time
R. David Murray added the comment: I've checked in the encode version of the method. I'm going to pass on doing the other inlines, given that the improvement isn't that large. I will, however, keep the issue in mind as I make other changes to the code, and there will be a general performance review phase when I get done with the API additions/bug fixing in the email6 project. -- resolution: - fixed stage: patch review - committed/rejected status: open - closed ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11454 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11454] email.message import time
Changes by Ezio Melotti ezio.melo...@gmail.com: -- stage: - patch review versions: +Python 3.4 -Python 3.3 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11454 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11454] email.message import time
R. David Murray added the comment: I'm really not willing to inline any of those pre-compiled regular expressions. They are precompiled because for a program processing lots of email, they are hot spots. We could use the same compile on demand dodge on them, though. Can you explain your changes to the -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11454 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11454] email.message import time
R. David Murray added the comment: Woops. Can you explain your changes to the ecre regex (keeping in mind that I don't know much about regex syntax). -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11454 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11454] email.message import time
R. David Murray added the comment: Oh, yeah, and the encode benchmark is very instructive, thanks Serhiy :) -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11454 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11454] email.message import time
Ezio Melotti added the comment: They are precompiled because for a program processing lots of email, they are hot spots. OK, I didn't know they were hot spots. Note that the regex are not recompiled everytime: they are compiled the first time and then taken from the cache (assuming they don't fall out from the bottom of the cache). This still has a small overhead though. Can you explain your changes to the ecre regex (keeping in mind that I don't know much about regex syntax). - (?Pcharset[^?]*?) # non-greedy up to the next ? is the charset + (?Pcharset[^?]*)# up to the next ? is the charset \?# literal ? (?Pencoding[qb])# either a q or a b, case insensitive \?# literal ? - (?Pencoded.*?) # non-greedy up to the next ?= is the encoded string + (?Pencoded[^?]*)# up to the next ?= is the encoded string \?= # literal ?= At the beginning, the non-greedy *? is unnecessary because [^?]* already stops at the first ? found. The second change might actually be wrong if encoded is allowed to contain lone '?'s. The original regex used '.*?\?=', which means match everything (including lone '?'s) until the first '?='), mine means match everything until the first '?' which works fine as long as lone '?'s are not allowed. Serhiy's suggestion is semantically different, but it might be still suitable if having _has_surrogate return True even for surrogates not in range \udc80-\udcff is OK. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11454 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11454] email.message import time
R. David Murray added the comment: Well, other surrogates will cause a different error later than with the current _has_surrogates logic, but it won't be any more mysterious than what would happen now, I think. Normally, if I understand correctly, other surrogates should never occur, so I don't think it is a real issue. Yes, lone '?'s should not stop the pattern match in an encoded string. Even though I don't think they are normally supposed to occur, they do occur when encoded words are encoded incorrectly, and we get a better error recovery result if we look for ?= as the end. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11454 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11454] email.message import time
Changes by Arfrever Frehtes Taifersar Arahesis arfrever@gmail.com: -- nosy: +Arfrever ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11454 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11454] email.message import time
Serhiy Storchaka added the comment: If I change the regex to _has_surrogates = re.compile('[\udc80-\udcff]').search, the tests still pass but there's no improvement on startup time (note: the previous regex was matching all the surrogates in this range too, however I'm not sure how well this is tested). What about _has_surrogates = re.compile('[^\udc80-\udcff]*\Z').match ? -- nosy: +storchaka ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11454 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11454] email.message import time
Ezio Melotti added the comment: What about _has_surrogates = re.compile('[^\udc80-\udcff]*\Z').match ? The runtime is a bit slower than re.compile('[\udc80-\udcff]').search, but otherwise it's faster than all the other alternatives. I haven't checked the startup-time, but I suspect it won't be better -- maybe even worse. -- Added file: http://bugs.python.org/file27223/issue11454_surr1.py ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11454 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11454] email.message import time
Serhiy Storchaka added the comment: I haven't checked the startup-time, but I suspect it won't be better -- maybe even worse. I suppose it will be much better. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11454 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11454] email.message import time
Serhiy Storchaka added the comment: Startup-time: $ ./python -m timeit -s 'import re' 're.compile(([^\ud800-\udbff]|\A)[\udc00-\udfff]([^\udc00-\udfff]|\Z)).search; re.purge()' 100 loops, best of 3: 4.16 msec per loop $ ./python -m timeit -s 'import re' 're.purge()' 're.compile([\udc80-\udcff]).search' 100 loops, best of 3: 5.72 msec per loop $ ./python -m timeit 'h=lambda s, p=set(map(chr, range(0xDC80, 0xDCFF+1))): any(c in p for c in s)' 1 loops, best of 3: 60.5 usec per loop $ ./python -m timeit -s 'import re' 're.purge()' 're.compile((?![^\udc80-\udcff])).search' 1000 loops, best of 3: 401 usec per loop $ ./python -m timeit -s 'import re' 're.purge()' 're.compile([^\udc80-\udcff]*\Z).match' 1000 loops, best of 3: 427 usec per loop Runtime: $ ./python -m timeit -s 'import re; h=re.compile(([^\ud800-\udbff]|\A)[\udc00-\udfff]([^\udc00-\udfff]|\Z)).search; s = A*1000' 'h(s)' 1000 loops, best of 3: 245 usec per loop $ ./python -m timeit -s 'import re; h=re.compile([\udc80-\udcff]).search; s = A*1000' 'h(s)' 1 loops, best of 3: 30.1 usec per loop $ ./python -m timeit -s 'h=lambda s, p=set(map(chr, range(0xDC80, 0xDCFF+1))): any(c in p for c in s); s = A*1000' 'h(s)' 1 loops, best of 3: 164 usec per loop $ ./python -m timeit -s 'import re; h=re.compile((?![^\udc80-\udcff])).search; s = A*1000' 'h(s)' 1 loops, best of 3: 98.3 usec per loop $ ./python -m timeit -s 'import re; h=re.compile([^\udc80-\udcff]*\Z).match; s = A*1000' 'h(s)' 1 loops, best of 3: 34.6 usec per loop -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11454 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11454] email.message import time
Serhiy Storchaka added the comment: Faster set-version: $ ./python -m timeit -s 'h=lambda s, hn=set(map(chr, range(0xDC80, 0xDD00))).isdisjoint: not hn(s); s = A*1000' 'h(s)' 1 loops, best of 3: 43.8 usec per loop -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11454 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11454] email.message import time
Changes by Ezio Melotti ezio.melo...@gmail.com: Removed file: http://bugs.python.org/file27223/issue11454_surr1.py ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11454 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11454] email.message import time
Changes by Ezio Melotti ezio.melo...@gmail.com: Removed file: http://bugs.python.org/file27203/issue11454_surr1.py ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11454 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11454] email.message import time
Ezio Melotti added the comment: Attached new benchmark file. Results: Testing runtime of the _has_surrogates functions Generating chars... Generating samples... 1.61 - re.compile(current_regex).search 0.24 - re.compile(short_regex).search 15.13 - return any(c in surrogates for c in s) 10.21 - for c in s: if c in surrogates: return True 0.85 - return re.search(short_regex, s) 0.83 - functools.partial(re.search, short_regex) 20.86 - for c in map(ord, s): if c in range(0xDC80, 0xDCFF+1): return True 19.68 - for c in map(ord, s): if 0xDC80 = c = 0xDCFF: return True 0.28 - re.compile('[^\udc80-\udcff]*\Z').match 7.00 - return not set(map(chr, range(0xDC80, 0xDCFF+1))).isdisjoint(s) Testing startup time 0.57 - r = re.compile('[\udc80-\udcff]').search 0.59 - r = re.compile('[^\udc80-\udcff]*\Z').match 199.79 - r = re.compile('[\udc80-\udcff]').search; purge() 22.62 - r = re.compile('[^\udc80-\udcff]*\Z').match; purge() 1.12 - r = pickle.loads(p) -- Added file: http://bugs.python.org/file27225/issue11454_benchmarks.py ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11454 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11454] email.message import time
R. David Murray added the comment: So by your measurements the short search is the clear winner? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11454 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11454] email.message import time
Ezio Melotti added the comment: Yes, however it has a startup cost that the function that returns re.search(short_regex, s) and the one with functool.partial don't have, because with these the compilation happens at the first call. If we use one of these two, the startup time will be reduced a lot, and the runtime will be ~2x faster. If we use re.compile(short_regex).search the startup time won't be reduced as much, but the runtime will be ~8x faster. Given that here we are trying to reduce the startup time and not the runtime, I think using one of those two functions is better. Another possible solution to improve the startup time is trying to optimize _optimize_unicode -- not sure how much can be done there though. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11454 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11454] email.message import time
R. David Murray added the comment: This issue may be about reducing the startup time, but this function is a hot spot in the email package so I would prefer to sacrifice startup time optimization for an increase in speed. However, given the improvements to import locking in 3.3, what about a self replacing function? def _has_surrogates(s): import email.utils f = re.compile('[\udc80-\udcff]').search email.utils._has_surrogates = f return f(s) -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11454 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11454] email.message import time
Ezio Melotti added the comment: That might work. To avoid the overhead of the cache lookup I was thinking about something like regex = None def _has_surrogates(s): global regex if regex is None: regex = re.compile(short_regex) return regex.search(s) but I have discarded it because it's not very pretty and still has the overhead of the function and an additional if. Your version solves both the problems in a more elegant way. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11454 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11454] email.message import time
R. David Murray added the comment: It passed the email test suite. Patch attached. -- Added file: http://bugs.python.org/file27226/email_import_speedup.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11454 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11454] email.message import time
Ezio Melotti added the comment: It would be better to add/improve the _has_surrogates tests before committing. The patch I attached is also still valid if you want a further speed up improvement. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11454 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11454] email.message import time
Serhiy Storchaka added the comment: def _has_surrogates(s): try: s.encode() return False except UnicodeEncodeError: return True Results: 0.26 - re.compile(short_regex).search 0.06 - try encode -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11454 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11454] email.message import time
Ezio Melotti added the comment: re.compile seems twice as fast as pickle.loads: import re import pickle import timeit N = 10 s = r = re.compile('[\\udc80-\\udcff]') t = timeit.Timer(s, 'import re') print(%6.2f - re.compile % t.timeit(number=N)) s = r = pickle.loads(p) p = pickle.dumps(re.compile('[\udc80-\udcff]')) t = timeit.Timer(s, 'import pickle; from __main__ import p') print(%6.2f - pickle.loads % t.timeit(number=N)) Result: 5.59 - re.compile 11.04 - pickle.loads See also #2679. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11454 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11454] email.message import time
R. David Murray added the comment: Considering how often that test is done, I would consider the compiled version of the short regex the clear winner based on your numbers. I wonder if we could precompile the regex and load it from a pickle. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11454 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11454] email.message import time
Ezio Melotti added the comment: I tried to remove a few unused regex and inline some of the others (the re module has its own caching anyway and they don't seem to be documented), but it didn't get so much faster (see attached patch). I then put the second list of email imports of the previous message in a file and run it with cprofile and these are the results: === Without patch === $ time ./python -m issue11454_imp2 [69308 refs] real0m0.337s user0m0.312s sys 0m0.020s $ ./python -m cProfile -s time issue11454_imp2.py 15130 function calls (14543 primitive calls) in 0.191 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 260.0290.0010.0290.001 {built-in method loads} 12480.0150.0000.0180.000 sre_parse.py:184(__next) 30.0100.0030.0150.005 sre_compile.py:301(_optimize_unicode) 48/170.0090.0000.0370.002 sre_parse.py:418(_parse) 30/10.0080.0000.1910.191 {built-in method exec} 820.0070.0000.0240.000 {built-in method __build_class__} 250.0060.0000.0240.001 sre_compile.py:207(_optimize_charset) 80.0050.0010.0050.001 {built-in method load_dynamic} 11220.0050.0000.0220.000 sre_parse.py:209(get) 1770.0050.0000.0050.000 {built-in method stat} 1070.0050.0000.0120.000 frozen importlib._bootstrap:1350(find_loader) 2944/29190.0040.0000.0040.000 {built-in method len} 69/150.0030.0000.0280.002 sre_compile.py:32(_compile) 90.0030.0000.0030.000 sre_compile.py:258(_mk_bitmap) 940.0020.0000.0030.000 frozen importlib._bootstrap:74(_path_join) === With patch === $ time ./python -m issue11454_imp2 [69117 refs] real0m0.319s user0m0.304s sys 0m0.012s $ ./python -m cProfile -s time issue11454_imp2.py 11281 function calls (10762 primitive calls) in 0.162 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 210.0220.0010.0220.001 {built-in method loads} 30.0110.0040.0150.005 sre_compile.py:301(_optimize_unicode) 7080.0080.0000.0100.000 sre_parse.py:184(__next) 30/10.0080.0000.2380.238 {built-in method exec} 820.0070.0000.0230.000 {built-in method __build_class__} 1870.0050.0000.0050.000 {built-in method stat} 80.0050.0010.0050.001 {built-in method load_dynamic} 1070.0050.0000.0120.000 frozen importlib._bootstrap:1350(find_loader) 29/80.0050.0000.0200.002 sre_parse.py:418(_parse) 110.0040.0000.0200.002 sre_compile.py:207(_optimize_charset) 6430.0030.0000.0120.000 sre_parse.py:209(get) 50.0030.0010.0030.001 {built-in method dumps} 940.0020.0000.0030.000 frozen importlib._bootstrap:74(_path_join) 2570.0020.0000.0020.000 quoprimime.py:56(genexpr) 260.0020.0000.1160.004 frozen importlib._bootstrap:938(get_code) 1689/16760.0020.0000.0020.000 {built-in method len} 310.0020.0000.0030.000 frozen importlib._bootstrap:1034(get_data) 2560.0020.0000.0020.000 {method 'setdefault' of 'dict' objects} 1190.0020.0000.0030.000 frozen importlib._bootstrap:86(_path_split) 350.0020.0000.0190.001 frozen importlib._bootstrap:1468(_find_module) 340.0020.0000.0150.000 frozen importlib._bootstrap:1278(_get_loader) 39/60.0020.0000.0230.004 sre_compile.py:32(_compile) 26/30.0010.0000.2350.078 frozen importlib._bootstrap:853(_load_module) The time spent in sre_compile.py:301(_optimize_unicode) most likely comes from email.utils._has_surrogates (there's a further speedup when it's commented away): _has_surrogates = re.compile('([^\ud800-\udbff]|\A)[\udc00-\udfff]([^\udc00-\udfff]|\Z)').search This is used in a number of places, so it can't be inlined. I wanted to optimize it but I'm not sure what it's supposed to do. It matches lone low surrogates, but not lone high ones, and matches some invalid sequences, but not others: _has_surrogates('\ud800') # lone high _has_surrogates('\udc00') # lone low _sre.SRE_Match object at 0x9ae00e8 _has_surrogates('\ud800\udc00') # valid pair (high+low) _has_surrogates('\ud800\ud800\udc00') # invalid sequence (lone high, valid high+low) _has_surrogates('\udc00\ud800\ud800\udc00') # invalid sequence (lone low, lone high, valid high+low) _sre.SRE_Match object at 0x9ae0028 FWIW this was
[issue11454] email.message import time
R. David Murray added the comment: It detects whether a string contains any characters have been surrogate escaped by the surrogate escape handler. I disliked using it, but I didn't know of any better way to do that detection. It's on my long list of things to come back to eventually and try to improve :) -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11454 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11454] email.message import time
Ezio Melotti added the comment: Given that high surrogates are U+D800..U+DBFF, and low ones are U+DC00..U+DFFF, '([^\ud800-\udbff]|\A)[\udc00-\udfff]([^\udc00-\udfff]|\Z)' means a low surrogates, preceded by either an high one or line beginning, and followed by another low one or line end. PEP 838 says With this PEP, non-decodable bytes = 128 will be represented as lone surrogate codes U+DC80..U+DCFF. If I change the regex to _has_surrogates = re.compile('[\udc80-\udcff]').search, the tests still pass but there's no improvement on startup time (note: the previous regex was matching all the surrogates in this range too, however I'm not sure how well this is tested). If I change the implementation with _pep383_surrogates = set(map(chr, range(0xDC80, 0xDCFF+1))) def _has_surrogates(s): return any(c in _pep383_surrogates for c in s) the tests still pass and the startup is ~15ms faster here: $ time ./python -m issue11454_imp2 [68837 refs] real0m0.305s user0m0.288s sys 0m0.012s However using this function instead of the regex is ~10x slower at runtime. Using the shorter regex is about ~7x faster, but there are no improvements on the startup time. Assuming the shorter regex is correct, it can still be called inside a function or used with functools.partial. This will result in a improved startup time and a ~2x improvement on runtime (so it's a win-win). See attached patch for benchmarks. This is a sample result: 17.01 usec/pass - re.compile(current_regex).search 2.20 usec/pass - re.compile(short_regex).search 148.18 usec/pass - return any(c in surrogates for c in s) 106.35 usec/pass - for c in s: if c in surrogates: return True 8.40 usec/pass - return re.search(short_regex, s) 8.20 usec/pass - functools.partial(re.search, short_regex) -- Added file: http://bugs.python.org/file27203/issue11454_surr1.py ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11454 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11454] email.message import time
Changes by Ross Lagerwall rosslagerw...@gmail.com: -- title: urllib.request import time - email.message import time ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11454 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com