On 2016-01-27 3:01 PM, Brett Cannon wrote:


[..]


    We can also optimize LOAD_METHOD.  There are high chances, that
    'obj' in
    'obj.method()' will be of the same type every time we execute the code
    object.  So if we'd have an opcodes cache, LOAD_METHOD could then
    cache
    a pointer to the resolved unbound method, a pointer to obj.__class__,
    and tp_version_tag of obj.__class__.  Then it would only need to check
    if the cached object type is the same (and that it wasn't
    modified) and
    that obj.__dict__ doesn't override 'method'.  Long story short, this
    caching really speeds up method calls on types implemented in C.
    list.append becomes very fast, because list doesn't have a
    __dict__, so
    the check is very cheap (with cache).


What would it take to make this work with Python-defined classes?

It already works for Python-defined classes. But it's a bit more expensive because you still have to check object's __dict__. Still, there is a very noticeable performance increase (see the results of benchmark runs).

I guess that would require knowing the version of the instance's __dict__, the instance's __class__ version, the MRO, and where the method object was found in the MRO and any intermediary classes to know if it was suddenly shadowed? I think that's everything. :)

No, unfortunately we can't use the version of the instance's __dict__ as it is very volatile. The current implementation of opcode cache works because types are much more stable. Remember, the cache is per *code object*, so it should work for all times when code object is executed.

class F:
  def spam(self):
    self.ham()   # <- version of self.__dict__ is unstable
                 #    so we'll endup invalidating the cache
                 #    too often

__class__ version, MRO changes etc are covered by tp_version_tag, which I use as one of guards.


Obviously that's a lot, but I wonder how many classes have a deep inheritance model vs. inheriting only from `object`? In that case you only have to check self.__dict__.ma_version, self.__class__, self.__class__.__dict__.ma_version, and self.__class__.__class__ == `type`. I guess another way to look at this is to get an idea of how complex do the checks have to get before caching something like this is not worth it (probably also depends on how often you mutate self.__dict__ thanks to mutating attributes, but you could in that instance just decide to always look at self.__dict__ for the method's key and then do the ma_version cache check for everything coming from the class).

Otherwise we can consider looking at the the caching strategies that Self helped pioneer (http://bibliography.selflanguage.org/) that all of the various JS engines lifted and consider caching all method lookups.

Yeah, hidden classes are great. But the infrastructure to support them properly is huge. I think that to make them work you'll need a JIT -- to trace, deoptimize, optimize, and do it all with a reasonable memory footprint. My patch is much smaller and simpler, something we can realistically tune and ship in 3.6.


    A straightforward way to implement such a cache is simple, but
    consumes
    a lot of memory, that would be just wasted, since we only need such a
    cache for LOAD_GLOBAL and LOAD_METHOD opcodes. So we have to be
    creative
    about the cache design.  Here's what I came up with:

    1. We add a few fields to the code object.

    2. ceval will count how many times each code object is executed.

    3. When the code object is executed over ~900 times, we mark it as
    "hot".


What happens if you simply consider all code as hot? Is the overhead of building the mapping such that you really need this, or is this simply to avoid some memory/startup cost?

That's the first step for this patch. I think we need to profile several big applications (I'll do it later for some of my code bases) and see how big is the memory impact if we optimize everything.

In any case, I expect it to be noticeable (which may be acceptable), so we'll probably try to optimize it.

      We also create an 'unsigned char' array "MAPPING", with length
    set to match the length of the code object.  So we have a 1-to-1
    mapping
    between opcodes and MAPPING array.

    4. Next ~100 calls, while the code object is "hot", LOAD_GLOBAL and
    LOAD_METHOD do "MAPPING[opcode_offset()]++".

    5. After 1024 calls to the code object, ceval loop will iterate
    through
    the MAPPING, counting all opcodes that were executed more than 50
    times.


Where did the "50 times" boundary come from? Was this measured somehow or did you just guess at a number?

If the number is too low, then you'll optimize code in branches that are rarely executed. So I picked 50, because I only trace opcodes for 100 calls.

All of those numbers can be (should be?) changed, and I think we should experiment with different heuristics.

[..]


    If you're interested in these kind of optimizations, please help with
    code reviews, ideas, profiling and benchmarks.  The latter is
    especially
    important, I'd never imagine how hard it is to come up with a good
    macro
    benchmark.


Have you tried hg.python.org/benchmarks <http://hg.python.org/benchmarks>?

Yes: https://gist.github.com/1st1/aed69d63a2ff4de4c7be

Or are you looking for new benchmarks? If the latter then we should probably strike up a discussion on speed@ and start considering a new, unified benchmark suite that CPython, PyPy, Pyston, Jython, and IronPython can all agree on.

Yes, IMHO we need better benchmarks. Some of the existing ones are very unstable -- I can run them three times and get three completely different results. Benchmarking is hard :)

I'll create a few issues on bugs.python.org with new/updated benchmarks, and will join the speed@ mailing list.

Yury
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to