On 2016-01-27 3:01 PM, Brett Cannon wrote:
[..]
We can also optimize LOAD_METHOD. There are high chances, that
'obj' in
'obj.method()' will be of the same type every time we execute the code
object. So if we'd have an opcodes cache, LOAD_METHOD could then
cache
a pointer to the resolved unbound method, a pointer to obj.__class__,
and tp_version_tag of obj.__class__. Then it would only need to check
if the cached object type is the same (and that it wasn't
modified) and
that obj.__dict__ doesn't override 'method'. Long story short, this
caching really speeds up method calls on types implemented in C.
list.append becomes very fast, because list doesn't have a
__dict__, so
the check is very cheap (with cache).
What would it take to make this work with Python-defined classes?
It already works for Python-defined classes. But it's a bit more
expensive because you still have to check object's __dict__. Still,
there is a very noticeable performance increase (see the results of
benchmark runs).
I guess that would require knowing the version of the instance's
__dict__, the instance's __class__ version, the MRO, and where the
method object was found in the MRO and any intermediary classes to
know if it was suddenly shadowed? I think that's everything. :)
No, unfortunately we can't use the version of the instance's __dict__ as
it is very volatile. The current implementation of opcode cache works
because types are much more stable. Remember, the cache is per *code
object*, so it should work for all times when code object is executed.
class F:
def spam(self):
self.ham() # <- version of self.__dict__ is unstable
# so we'll endup invalidating the cache
# too often
__class__ version, MRO changes etc are covered by tp_version_tag, which
I use as one of guards.
Obviously that's a lot, but I wonder how many classes have a deep
inheritance model vs. inheriting only from `object`? In that case you
only have to check self.__dict__.ma_version, self.__class__,
self.__class__.__dict__.ma_version, and self.__class__.__class__ ==
`type`. I guess another way to look at this is to get an idea of how
complex do the checks have to get before caching something like this
is not worth it (probably also depends on how often you mutate
self.__dict__ thanks to mutating attributes, but you could in that
instance just decide to always look at self.__dict__ for the method's
key and then do the ma_version cache check for everything coming from
the class).
Otherwise we can consider looking at the the caching strategies that
Self helped pioneer (http://bibliography.selflanguage.org/) that all
of the various JS engines lifted and consider caching all method lookups.
Yeah, hidden classes are great. But the infrastructure to support them
properly is huge. I think that to make them work you'll need a JIT --
to trace, deoptimize, optimize, and do it all with a reasonable memory
footprint. My patch is much smaller and simpler, something we can
realistically tune and ship in 3.6.
A straightforward way to implement such a cache is simple, but
consumes
a lot of memory, that would be just wasted, since we only need such a
cache for LOAD_GLOBAL and LOAD_METHOD opcodes. So we have to be
creative
about the cache design. Here's what I came up with:
1. We add a few fields to the code object.
2. ceval will count how many times each code object is executed.
3. When the code object is executed over ~900 times, we mark it as
"hot".
What happens if you simply consider all code as hot? Is the overhead
of building the mapping such that you really need this, or is this
simply to avoid some memory/startup cost?
That's the first step for this patch. I think we need to profile
several big applications (I'll do it later for some of my code bases)
and see how big is the memory impact if we optimize everything.
In any case, I expect it to be noticeable (which may be acceptable), so
we'll probably try to optimize it.
We also create an 'unsigned char' array "MAPPING", with length
set to match the length of the code object. So we have a 1-to-1
mapping
between opcodes and MAPPING array.
4. Next ~100 calls, while the code object is "hot", LOAD_GLOBAL and
LOAD_METHOD do "MAPPING[opcode_offset()]++".
5. After 1024 calls to the code object, ceval loop will iterate
through
the MAPPING, counting all opcodes that were executed more than 50
times.
Where did the "50 times" boundary come from? Was this measured somehow
or did you just guess at a number?
If the number is too low, then you'll optimize code in branches that are
rarely executed. So I picked 50, because I only trace opcodes for 100
calls.
All of those numbers can be (should be?) changed, and I think we should
experiment with different heuristics.
[..]
If you're interested in these kind of optimizations, please help with
code reviews, ideas, profiling and benchmarks. The latter is
especially
important, I'd never imagine how hard it is to come up with a good
macro
benchmark.
Have you tried hg.python.org/benchmarks <http://hg.python.org/benchmarks>?
Yes: https://gist.github.com/1st1/aed69d63a2ff4de4c7be
Or are you looking for new benchmarks? If the latter then we should
probably strike up a discussion on speed@ and start considering a new,
unified benchmark suite that CPython, PyPy, Pyston, Jython, and
IronPython can all agree on.
Yes, IMHO we need better benchmarks. Some of the existing ones are very
unstable -- I can run them three times and get three completely
different results. Benchmarking is hard :)
I'll create a few issues on bugs.python.org with new/updated benchmarks,
and will join the speed@ mailing list.
Yury
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com