Re: [Python-Dev] Speeding up CPython 5-10%

Yury Selivanov Tue, 02 Feb 2016 06:25:20 -0800


On 2016-02-02 4:28 AM, Victor Stinner wrote:
[..]

I take a first look at your patch and sorry,


Thanks for the initial code review!

I'm skeptical about the
design. I have to play with it a little bit more to check if there is
no better design.


So far I see two things you are worried about:


1. The cache is attached to the code object vs function/frame.

I think the code object is the perfect place for such a cache.

The cache must be there (and survive!) "across" the frames.
If you attach it to the function object, you'll have to
re-attach it to a frame object on each PyEval call.
I can't see how that would be better.


2. Two levels of indirection in my cache -- offsets table +
cache table.

In my other email thread "Opcode cache in ceval loop" I
explained that optimizing every code object in the standard
library and unittests adds 5% memory overhead.  Optimizing
only those that are called frequently is less than 1%.

Besides, many functions that you import are never called, or
only called once or twice.  And code objects for modules
and class bodies are called once.

If we don't use an offset table and just allocate a cache
entry for every opcode, then the memory usage will raise
*significantly*.  Right now the overhead of the offset table
is *8 bits* per opcode, the overhead of the cache table is
*32 bytes* per an optimized opcode.  The overhead of
using 1 extra indirection is minimal.

[..]



2016-01-27 19:25 GMT+01:00 Yury Selivanov <yselivanov...@gmail.com>:

tl;dr The summary is that I have a patch that improves CPython performance
up to 5-10% on macro benchmarks.  Benchmarks results on Macbook Pro/Mac OS
X, desktop CPU/Linux, server CPU/Linux are available at [1].  There are no
slowdowns that I could reproduce consistently.

That's really impressive, great job Yury :-) Getting non-negligible
speedup on large macrobenchmarks became really hard in CPython.
CPython is already well optimized in all corners. It looks like the
overall Python performance still depends heavily on the performance of
dictionary and attribute lookups. Even if it was well known, I didn't
expect up to 10% speedup on *macro* benchmarks.


Thanks!

LOAD_METHOD & CALL_METHOD
-------------------------

We had a lot of conversations with Victor about his PEP 509, and he sent me
a link to his amazing compilation of notes about CPython performance [2].
One optimization that he pointed out to me was LOAD/CALL_METHOD opcodes, an
idea first originated in PyPy.

There is a patch that implements this optimization, it's tracked here: [3].
There are some low level details that I explained in the issue, but I'll go
over the high level design in this email as well.

Your cache is stored directly in code objects. Currently, code objects
are immutable.


Code objects are immutable on the Python level.  My cache
doesn't make any previously immutable field mutable.

Adding a few mutable cache structures visible only at the
C level is acceptable I think.


Antoine Pitrou's patch adding a LOAD_GLOBAL cache adds a cache to
functions with an "alias" in each frame object:
http://bugs.python.org/issue10401

Andrea Griffini's patch also adding a cache for LOAD_GLOBAL adds a
cache for code objects too.
https://bugs.python.org/issue1616125


Those patches are nice, but optimizing just LOAD_GLOBAL
won't give you a big speed-up.  For instance, 2to3 became
7-8% faster once I started to optimize LOAD_ATTR.

The idea of my patch is that it implements caching
in such a way, that we can add it to several different
opcodes.

The opcodes we want to optimize are LAOD_GLOBAL, 0 and 3.  Let's look at the
first one, that loads the 'print' function from builtins.  The opcode knows
the following bits of information:

I tested your latest patch. It looks like LOAD_GLOBAL never
invalidates the cache on cache miss ("deoptimize" the instruction).


Yes, that was a deliberate decision (but we can add the
deoptimization easily).  So far I haven't seen a use case
or benchmark where we really need to deoptimize.


I suggest to always invalidate the cache at each cache miss. Not only,
it's common to modify global variables, but there is also the issue of
different namespace used with the same code object. Examples:

* late global initialization. See for example _a85chars cache of
base64.a85encode.
* code object created in a temporary namespace and then always run in
a different global namespace. See for example
collections.namedtuple(). I'm not sure that it's the best example
because it looks like the Python code only loads builtins, not
globals. But it looks like your code keeps a copy of the version of
the global namespace dict.

I tested with a threshold of 1: always optimize all code objects.
Maybe with your default threshold of 1024 runs, the issue with
different namespaces doesn't occur in practice.


Yep. I added a constant in ceval.c that enables collection
of opcode cache stats.

99.9% of all global dicts in benchmarks are stable.

test suite was a bit different, only ~99% :) One percent
of cache misses was probably because of unittest.mock.

A straightforward way to implement such a cache is simple, but consumes a
lot of memory, that would be just wasted, since we only need such a cache
for LOAD_GLOBAL and LOAD_METHOD opcodes. So we have to be creative about the
cache design.

I'm not sure that it's worth to develop a complex dynamic logic to
only enable optimizations after a threshold (design very close to a
JIT compiler).


I think it's not even remotely close to what JITs do.

In my design I have a simple counter -- when it reaches
1000, we create the caches in the code objects.  Some opcodes
start to use it.  That's basically it.

JIT compilers trace the code, collect information
about types, think about memory, optimize, deoptimize,
think about memory again, etc, etc :)

What is the overhead (% of RSS memory) on a concrete
application when all code objects are optimized at startup?


I've mentioned that in my other thread.

When the whole test suite is run with *every* code object
being optimized (threshold = 1), about 73000 code objects
were optimized, requiring >20Mb of memory (the test suite
process consumed ~400Mb of memory).  So 5% looks to be
the worst case.

When I ran the test suite with threshold set to 1024, only
2000 objects were optimized, requiring less than 1% of
the total process memory.


Maybe we need a global boolean flag to disable the optimization? Or
even a compilation option?


I'd hate to add such thing.  Why would you want to disable
the cache?  To save 1% of memory?  TBH I think this only
adds maintenance overhead to us.


I mean that all these new counters have a cost, and the code may be
even faster without these counters if everything is always optimized,
no?


Yes, but only marginally.   You'll save one "inc" in eval
loop.  And a couple of "if"s.  Maybe on a micro benchmark
you can see a difference.

But optimizing everything will require much more memory.
And we shouldn't optimize code objects that are run only
once -- that's code objects for modules and classes.

Threshold of 1024 is big enough to say that the code
object is frequently used and will probably continue
to be frequently used in the future.


I'm not sure that the storage for the cache is really efficient. It's
a compact data structure, but it looks "expensive" to access it (there
is one level of indirection). I understand that it's compact to reduce
the memory footpring overhead.

I'm not sure that the threshold of 1000x run is ok for short scripts.
It would be nice to optimize also scripts which only call a function
900x times :-) Classical memory vs cpu compromise issue :-)


I'd be OK to change the threshold to 500 or something.  But
IMHO it won't change much.  Short/small scripts won't hit
it anyways.  And even if they do, they typically don't run
long enough to get a measurable speedup.


I'm just thinking aloud :-)


Thanks!  I'm happy that you are looking at this thing
with a critical eye.


BTW, here's a debug output of unit tests with every code
object optimized:

-- Opcode cache number of objects  = 72395
-- Opcode cache total extra mem    = 20925595

-- Opcode cache LOAD_METHOD hits   = 64569036 (63%)
-- Opcode cache LOAD_METHOD misses = 23899 (0%)
-- Opcode cache LOAD_METHOD opts   = 104872
-- Opcode cache LOAD_METHOD deopts = 19191
-- Opcode cache LOAD_METHOD dct-chk= 12805608
-- Opcode cache LOAD_METHOD total  = 101735114

-- Opcode cache LOAD_GLOBAL hits   = 123808815 (99%)
-- Opcode cache LOAD_GLOBAL misses = 310397 (0%)
-- Opcode cache LOAD_GLOBAL opts   = 125205

-- Opcode cache LOAD_ATTR hits     = 59089435 (53%)
-- Opcode cache LOAD_ATTR misses   = 33372 (0%)
-- Opcode cache LOAD_ATTR opts     = 73643
-- Opcode cache LOAD_ATTR deopts   = 20276
-- Opcode cache LOAD_ATTR total    = 111049468


Yury
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Speeding up CPython 5-10%

Reply via email to