[Python-Dev] Opcode cache in ceval loop

Yury Selivanov Mon, 01 Feb 2016 11:12:18 -0800

Hi,

This is the second email thread I start regarding implementing an opcodecache in ceval loop. Since my first post on this topic:


- I've implemented another optimization (LOAD_ATTR);

- I've added detailed statistics mode so that I can "see" how the cacheperforms and tune it;

- some macro benchmarks are now 10-20% faster; 2to3 (a real application)is 7-8% faster;


- and I have some good insights on the memory footprint.

** The purpose of this email is to get a general approval frompython-dev, so that I can start polishing the patches and getting themreviewed/committed. **



Summary of optimizations
------------------------

When a code object is executed more than ~1000 times, it's considered"hot". It gets its opcodes analyzed to initialize caches forLOAD_METHOD (a new opcode I propose to add in [1]), LOAD_ATTR, andLOAD_GLOBAL.

It's important to only optimize code objects that were executed "enough"times, to avoid optimizing code objects for modules, classes, andfunctions that were imported but never used.

The cache struct is defined in code.h [2], and is 32 bytes long. When acode object becomes hot, it gets an cache offset table allocated for it(+1 byte for each opcode) + an array of cache structs.

To measure the max/average memory impact, I tuned my code to optimize*every* code object on *first* run. Then I ran the entire Python testsuite. Python test suite + standard library both contain around 72395code objects, which required 20Mb of memory for caches. The testprocess consumed around 400Mb of memory. Thus, the absolute worst casescenario, the overhead is about 5%.

Then I ran the test suite without any modifications to the patch. Thismeans that only code objects that are called frequently enough areoptimized. In this more, only 2072 code objects were optimized, usingless than 1Mb of memory for the cache.



LOAD_ATTR
---------

Damien George mentioned that they optimize a lot of dict lookups inMicroPython by memorizing last key/value offset in the dict object, thuseliminating lots of hash lookups. I've implemented this optimization inmy patch. The results are quite good. A simple micro-benchmark [3]shows ~30% speed improvement. Here are some debug stats generated by2to3 benchmark:


-- Opcode cache LOAD_ATTR hits     = 14778415 (83%)
-- Opcode cache LOAD_ATTR misses   = 750 (0%)
-- Opcode cache LOAD_ATTR opts     = 282
-- Opcode cache LOAD_ATTR deopts   = 60
-- Opcode cache LOAD_ATTR total    = 17777912

Each "hit" makes LOAD_ATTR about 30% faster.


LOAD_GLOBAL
-----------

This turned out to be a very stable optimization. Here is the debugoutput of the 2to3 test:


-- Opcode cache LOAD_GLOBAL hits   = 3940647 (100%)
-- Opcode cache LOAD_GLOBAL misses = 0 (0%)
-- Opcode cache LOAD_GLOBAL opts   = 252

All benchmarks (and real code) have stats like that. Globals andbuiltins are very rarely modified, so the cache works really well. WithLOAD_GLOBAL opcode cache, global lookup is very cheap, there is no hashlookup for it at all. It makes optimizations like "def foo(len=len)"obsolete.



LOAD_METHOD
-----------

This is a new opcode I propose to add in [1]. The idea is to substituteLOAD_ATTR with it, and avoid instantiation of BoundMethod objects.

With the cache, we can store a reference to the method descriptor (I usetype->tp_version_tag for cache invalidation, the same thing_PyType_Lookup is built around).

The cache makes LOAD_METHOD really efficient. A simple micro-benchmarklike [4], shows that with the cache and LOAD_METHOD,"s.startswith('abc')" becomes as efficient as "s[:3] == 'abc'".

LOAD_METHOD/CALL_FUNCTION without cache is about 20% faster thanLOAD_ATTR/CALL_FUNCTION. With the cache, it's about 30% faster.


Here's the debug output of the 2to3 benchmark:

-- Opcode cache LOAD_METHOD hits   = 5164848 (64%)
-- Opcode cache LOAD_METHOD misses = 12 (0%)
-- Opcode cache LOAD_METHOD opts   = 94
-- Opcode cache LOAD_METHOD deopts = 12
-- Opcode cache LOAD_METHOD dct-chk= 1614801
-- Opcode cache LOAD_METHOD total  = 7945954


What's next?
------------

First, I'd like to merge the new LOAD_METHOD opcode, see issue 26110[1]. It's a very straightforward optimization, the patch is small andeasy to review.

Second, I'd like to merge the new opcode cache, see issue 26219 [5].All unittests pass. Memory usage increase is very moderate (<1mb forthe entire test suite), and the performance increase is significant.The only potential blocker for this is PEP 509 approval (which I'd behappy to assist with).


What do you think?

Thanks,
Yury


[1] http://bugs.python.org/issue26110
[2] https://github.com/1st1/cpython/blob/opcache5/Include/code.h#L10
[3] https://gist.github.com/1st1/37d928f1e84813bf1c44
[4] https://gist.github.com/1st1/10588e6e11c4d7c19445
[5] http://bugs.python.org/issue26219

_______________________________________________
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] Opcode cache in ceval loop

Reply via email to