Hi,

This is the second email thread I start regarding implementing an opcode cache in ceval loop. Since my first post on this topic:

- I've implemented another optimization (LOAD_ATTR);

- I've added detailed statistics mode so that I can "see" how the cache performs and tune it;

- some macro benchmarks are now 10-20% faster; 2to3 (a real application) is 7-8% faster;

- and I have some good insights on the memory footprint.

** The purpose of this email is to get a general approval from python-dev, so that I can start polishing the patches and getting them reviewed/committed. **


Summary of optimizations
------------------------

When a code object is executed more than ~1000 times, it's considered "hot". It gets its opcodes analyzed to initialize caches for LOAD_METHOD (a new opcode I propose to add in [1]), LOAD_ATTR, and LOAD_GLOBAL.

It's important to only optimize code objects that were executed "enough" times, to avoid optimizing code objects for modules, classes, and functions that were imported but never used.

The cache struct is defined in code.h [2], and is 32 bytes long. When a code object becomes hot, it gets an cache offset table allocated for it (+1 byte for each opcode) + an array of cache structs.

To measure the max/average memory impact, I tuned my code to optimize *every* code object on *first* run. Then I ran the entire Python test suite. Python test suite + standard library both contain around 72395 code objects, which required 20Mb of memory for caches. The test process consumed around 400Mb of memory. Thus, the absolute worst case scenario, the overhead is about 5%.

Then I ran the test suite without any modifications to the patch. This means that only code objects that are called frequently enough are optimized. In this more, only 2072 code objects were optimized, using less than 1Mb of memory for the cache.


LOAD_ATTR
---------

Damien George mentioned that they optimize a lot of dict lookups in MicroPython by memorizing last key/value offset in the dict object, thus eliminating lots of hash lookups. I've implemented this optimization in my patch. The results are quite good. A simple micro-benchmark [3] shows ~30% speed improvement. Here are some debug stats generated by 2to3 benchmark:

-- Opcode cache LOAD_ATTR hits     = 14778415 (83%)
-- Opcode cache LOAD_ATTR misses   = 750 (0%)
-- Opcode cache LOAD_ATTR opts     = 282
-- Opcode cache LOAD_ATTR deopts   = 60
-- Opcode cache LOAD_ATTR total    = 17777912

Each "hit" makes LOAD_ATTR about 30% faster.


LOAD_GLOBAL
-----------

This turned out to be a very stable optimization. Here is the debug output of the 2to3 test:

-- Opcode cache LOAD_GLOBAL hits   = 3940647 (100%)
-- Opcode cache LOAD_GLOBAL misses = 0 (0%)
-- Opcode cache LOAD_GLOBAL opts   = 252

All benchmarks (and real code) have stats like that. Globals and builtins are very rarely modified, so the cache works really well. With LOAD_GLOBAL opcode cache, global lookup is very cheap, there is no hash lookup for it at all. It makes optimizations like "def foo(len=len)" obsolete.


LOAD_METHOD
-----------

This is a new opcode I propose to add in [1]. The idea is to substitute LOAD_ATTR with it, and avoid instantiation of BoundMethod objects.

With the cache, we can store a reference to the method descriptor (I use type->tp_version_tag for cache invalidation, the same thing _PyType_Lookup is built around).

The cache makes LOAD_METHOD really efficient. A simple micro-benchmark like [4], shows that with the cache and LOAD_METHOD, "s.startswith('abc')" becomes as efficient as "s[:3] == 'abc'".

LOAD_METHOD/CALL_FUNCTION without cache is about 20% faster than LOAD_ATTR/CALL_FUNCTION. With the cache, it's about 30% faster.

Here's the debug output of the 2to3 benchmark:

-- Opcode cache LOAD_METHOD hits   = 5164848 (64%)
-- Opcode cache LOAD_METHOD misses = 12 (0%)
-- Opcode cache LOAD_METHOD opts   = 94
-- Opcode cache LOAD_METHOD deopts = 12
-- Opcode cache LOAD_METHOD dct-chk= 1614801
-- Opcode cache LOAD_METHOD total  = 7945954


What's next?
------------

First, I'd like to merge the new LOAD_METHOD opcode, see issue 26110 [1]. It's a very straightforward optimization, the patch is small and easy to review.

Second, I'd like to merge the new opcode cache, see issue 26219 [5]. All unittests pass. Memory usage increase is very moderate (<1mb for the entire test suite), and the performance increase is significant. The only potential blocker for this is PEP 509 approval (which I'd be happy to assist with).

What do you think?

Thanks,
Yury


[1] http://bugs.python.org/issue26110
[2] https://github.com/1st1/cpython/blob/opcache5/Include/code.h#L10
[3] https://gist.github.com/1st1/37d928f1e84813bf1c44
[4] https://gist.github.com/1st1/10588e6e11c4d7c19445
[5] http://bugs.python.org/issue26219

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to