Re: [Python-Dev] Speeding up CPython 5-10%
No problem, I did not think you were attacking me or find your response rude. On Wed, May 18, 2016, at 01:06 PM, Cesare Di Mauro wrote: > If you feel like I've attacked you, I apologize: it wasn't my > intention. Please, don't get it personal: I only reported my honest > opinion, albeit after a re-read it looks too rude, and I'm sorry > for that. > > Regarding the post-bytecode optimization issues, they are mainly > represented by the constant folding code, which is still in the > peephole stage. Once it's moved to the proper place (ASDL/AST), then > such kind of issues with the stack calculations disappear, whereas the > remaining ones can be addressed by a fix of the current > stackdepth_walk function. > > And just to be clear, I've nothing against your code. I simply think > that, due to my experience, it doesn't fit in CPython. > > Regards > Cesare > > 2016-05-18 18:50 GMT+02:00: >> __ >> Your criticisms may very well be true. IIRC though, I wrote that pass >> because what was available was not general enough. The >> stackdepth_walk function made assumptions that, while true of code >> generated by the current cpython frontend, were not universally true. >> If a goal is to move this calculation after any bytecode >> optimization, something along these lines seems like it will >> eventually be necessary. >> >> Anyway, just offering things already written. If you don't feel it's >> useful, no worries. >> >> >> On Wed, May 18, 2016, at 11:35 AM, Cesare Di Mauro wrote: >>> 2016-05-17 8:25 GMT+02:00 : In the project https://github.com/zachariahreed/byteasm I mentioned on the list earlier this month, I have a pass that to computes stack usage for a given sequence of bytecodes. It seems to be a fair bit more agressive than cpython. Maybe it's more generally useful. It's pure python rather than C though. >>> IMO it's too big, resource hungry, and slower, even if you convert >>> it in C. >>> >>> If you take a look at the current stackdepth_walk function which >>> CPython uses, it's much smaller (not even 50 lines in simple C code) >>> and quite efficient. >>> >>> Currently the problem is that it doesn't return the maximum depth of >>> the tree, but it updates the intermediate/current maximum, and >>> *then* it uses it for the subsequent calculations. So, the depth >>> artificially grows, like in the reported cases. >>> >>> It doesn't require a complete rewrite, but spending some time for >>> fine-tuning it. >>> >>> Regards >>> Cesare >> ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Speeding up CPython 5-10%
If you feel like I've attacked you, I apologize: it wasn't my intention. Please, don't get it personal: I only reported my honest opinion, albeit after a re-read it looks too rude, and I'm sorry for that. Regarding the post-bytecode optimization issues, they are mainly represented by the constant folding code, which is still in the peephole stage. Once it's moved to the proper place (ASDL/AST), then such kind of issues with the stack calculations disappear, whereas the remaining ones can be addressed by a fix of the current stackdepth_walk function. And just to be clear, I've nothing against your code. I simply think that, due to my experience, it doesn't fit in CPython. Regards Cesare 2016-05-18 18:50 GMT+02:00: > Your criticisms may very well be true. IIRC though, I wrote that pass > because what was available was not general enough. The stackdepth_walk > function made assumptions that, while true of code generated by the current > cpython frontend, were not universally true. If a goal is to move this > calculation after any bytecode optimization, something along these lines > seems like it will eventually be necessary. > > Anyway, just offering things already written. If you don't feel it's > useful, no worries. > > > On Wed, May 18, 2016, at 11:35 AM, Cesare Di Mauro wrote: > > 2016-05-17 8:25 GMT+02:00 : > > In the project https://github.com/zachariahreed/byteasm I mentioned on > the list earlier this month, I have a pass that to computes stack usage > for a given sequence of bytecodes. It seems to be a fair bit more > agressive than cpython. Maybe it's more generally useful. It's pure > python rather than C though. > > > IMO it's too big, resource hungry, and slower, even if you convert it in C. > > If you take a look at the current stackdepth_walk function which CPython > uses, it's much smaller (not even 50 lines in simple C code) and quite > efficient. > > Currently the problem is that it doesn't return the maximum depth of the > tree, but it updates the intermediate/current maximum, and *then* it uses > it for the subsequent calculations. So, the depth artificially grows, like > in the reported cases. > > It doesn't require a complete rewrite, but spending some time for > fine-tuning it. > > Regards > Cesare > > > ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Speeding up CPython 5-10%
Your criticisms may very well be true. IIRC though, I wrote that pass because what was available was not general enough. The stackdepth_walk function made assumptions that, while true of code generated by the current cpython frontend, were not universally true. If a goal is to move this calculation after any bytecode optimization, something along these lines seems like it will eventually be necessary. Anyway, just offering things already written. If you don't feel it's useful, no worries. On Wed, May 18, 2016, at 11:35 AM, Cesare Di Mauro wrote: > 2016-05-17 8:25 GMT+02:00: >> In the project https://github.com/zachariahreed/byteasm I >> mentioned on >> the list earlier this month, I have a pass that to computes >> stack usage >> for a given sequence of bytecodes. It seems to be a fair bit more >> agressive than cpython. Maybe it's more generally useful. It's pure >> python rather than C though. >> > IMO it's too big, resource hungry, and slower, even if you convert > it in C. > > If you take a look at the current stackdepth_walk function which > CPython uses, it's much smaller (not even 50 lines in simple C code) > and quite efficient. > > Currently the problem is that it doesn't return the maximum depth of > the tree, but it updates the intermediate/current maximum, and *then* > it uses it for the subsequent calculations. So, the depth artificially > grows, like in the reported cases. > > It doesn't require a complete rewrite, but spending some time for fine- > tuning it. > > Regards > Cesare ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Speeding up CPython 5-10%
2016-05-17 8:25 GMT+02:00: > In the project https://github.com/zachariahreed/byteasm I mentioned on > the list earlier this month, I have a pass that to computes stack usage > for a given sequence of bytecodes. It seems to be a fair bit more > agressive than cpython. Maybe it's more generally useful. It's pure > python rather than C though. > IMO it's too big, resource hungry, and slower, even if you convert it in C. If you take a look at the current stackdepth_walk function which CPython uses, it's much smaller (not even 50 lines in simple C code) and quite efficient. Currently the problem is that it doesn't return the maximum depth of the tree, but it updates the intermediate/current maximum, and *then* it uses it for the subsequent calculations. So, the depth artificially grows, like in the reported cases. It doesn't require a complete rewrite, but spending some time for fine-tuning it. Regards Cesare ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Speeding up CPython 5-10%
In the project https://github.com/zachariahreed/byteasm I mentioned on the list earlier this month, I have a pass that to computes stack usage for a given sequence of bytecodes. It seems to be a fair bit more agressive than cpython. Maybe it's more generally useful. It's pure python rather than C though. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Speeding up CPython 5-10%
2016-05-16 17:55 GMT+02:00 Meador Inge: > On Sun, May 15, 2016 at 2:23 AM, Cesare Di Mauro < > cesare.di.ma...@gmail.com> wrote: > > >> Just one thing that comes to my mind: is the stack depth calculation >> routine changed? It was suboptimal, and calculating a better number >> decreases stack allocation, and increases the frame usage. >> > > This is still a problem and came up again recently: > > http://bugs.python.org/issue26549 > > -- Meador > I saw the last two comments of the issues: this is what I was talking about (in particular the issue opened by Armin applies). However there's another case where the situation is even worse. Let me show a small reproducer: def test(self): for i in range(self.count): with self: pass The stack size reported by Python 2.7.11: >>> test.__code__.co_stacksize 6 Adding another with statement: >>> test.__code__.co_stacksize 7 But unfortunately with Python 3.5.1 the problematic is much worse: >>> test.__code__.co_stacksize 10 >>> test.__code__.co_stacksize 17 Here the situation is exacerbated by the fact that the WITH_CLEANUP instruction of Python 2.x was split into two (WITH_CLEANUP_START and WITH_CLEANUP_FINISH) in some Python 3 release. I don't know why two different instructions were introduced, but IMO it's better to have one instruction which handles all code finalization of the with statement, at least in this case. If there are other scenarios where two different instructions are needed, then ad-hoc instructions like those can be used. Regards, Cesare ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Speeding up CPython 5-10%
On Sun, May 15, 2016 at 2:23 AM, Cesare Di Maurowrote: > Just one thing that comes to my mind: is the stack depth calculation > routine changed? It was suboptimal, and calculating a better number > decreases stack allocation, and increases the frame usage. > This is still a problem and came up again recently: http://bugs.python.org/issue26549 -- Meador ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Speeding up CPython 5-10%
2016-02-01 17:54 GMT+01:00 Yury Selivanov: > Thanks for bringing this up! > > IIRC wpython was about using "fat" bytecodes, i.e. using 64bits per > bytecode instead of 8. No, it used 16, 32, and 48-bit per opcode (1, 2, or 3 16-bit words). > That allows to minimize the number of bytecodes, thus having some > performance increase. TBH, I don't think it was "significantly faster". > Please, take a look at the benchmarks, or compile it and check yourself. ;-) If I were to do some big refactoring of the ceval loop, I'd probably > consider implementing a register VM. While register VMs are a bit faster > than stack VMs (up to 20-30%), they would also allow us to apply more > optimizations, and even bolt on a simple JIT compiler. > > Yury WPython was an hybrid-VM: it supported both a stack-based and a register-based approach. I think that it's needed, since the nature of Python, because you can have operations with intermixed operands: constants, locals, globals, names. It's quite difficult to handle all possible cases with a register-based VM. Regards, Cesare ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Speeding up CPython 5-10%
2016-02-02 10:28 GMT+01:00 Victor Stinner: > 2016-01-27 19:25 GMT+01:00 Yury Selivanov : > > tl;dr The summary is that I have a patch that improves CPython > performance > > up to 5-10% on macro benchmarks. Benchmarks results on Macbook Pro/Mac > OS > > X, desktop CPU/Linux, server CPU/Linux are available at [1]. There are > no > > slowdowns that I could reproduce consistently. > > That's really impressive, great job Yury :-) Getting non-negligible > speedup on large macrobenchmarks became really hard in CPython. > CPython is already well optimized in all corners. It's long time since I took a look at CPython (3.2), but if it didn't changed a lot then there might be some corner cases still waiting to be optimized. ;-) Just one thing that comes to my mind: is the stack depth calculation routine changed? It was suboptimal, and calculating a better number decreases stack allocation, and increases the frame usage. > It looks like the > overall Python performance still depends heavily on the performance of > dictionary and attribute lookups. Even if it was well known, I didn't > expect up to 10% speedup on *macro* benchmarks. > True, but it might be mitigated in some ways, at least for built-in types. There are ideas about that, but they are a bit complicated to implement. The problem is with functions like len, which IMO should become attribute lookups ('foo'.len) or method executions ('foo'.len()). Then it'll be easier to accelerate their execution, with one of the above ideas. However such kind of changes belong to Guido, which defines the language structure/philosophy. IMO something like len should be part of the attributes exposed by an object: it's more "object-oriented". Whereas other things like open, file, sum, etc., are "general facilities". Regards, Cesare ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Speeding up CPython 5-10%
On 3 February 2016 at 03:52, Brett Cannonwrote: > Fifth, if we manage to show that a C API can easily be added to CPython to > make a JIT something that can simply be plugged in and be useful, then we > will also have a basic JIT framework for people to use. As I said, our use > of CoreCLR is just for ease of development. There is no reason we couldn't > use ChakraCore, v8, LLVM, etc. But since all of these JIT compilers would > need to know how to handle CPython bytecode, we have tried to design a > framework where JIT compilers just need a wrapper to handle code emission > and our framework that we are building will handle driving the code emission > (e.g., the wrapper needs to know how to emit add_integer(), but our > framework handles when to have to do that). That could also be really interesting in the context of pymetabiosis [1] if it meant that PyPy could still at least partially JIT the Python code running on the CPython side of the boundary. Cheers, Nick. [1] https://github.com/rguillebert/pymetabiosis -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Speeding up CPython 5-10%
Also, modern compiler technology tends to use "infinite register" machines for the intermediate representation, then uses register coloring to assign the actual registers (and generate spill code if needed). I've seen work on inter-function optimization for avoiding some register loads and stores (combined with tail-call optimization, it can turn recursive calls into loops in the register machine). On 2 February 2016 at 09:16, Sven R. Kunzewrote: > On 02.02.2016 00:27, Greg Ewing wrote: > >> Sven R. Kunze wrote: >> >>> Are there some resources on why register machines are considered faster >>> than stack machines? >>> >> >> If a register VM is faster, it's probably because each register >> instruction does the work of about 2-3 stack instructions, >> meaning less trips around the eval loop, so less unpredictable >> branches and less pipeline flushes. >> > > That's was I found so far as well. > > This assumes that bytecode dispatching is a substantial fraction >> of the time taken to execute each instruction. For something >> like cpython, where the operations carried out by the bytecodes >> involve a substantial amount of work, this may not be true. >> > > Interesting point indeed. It makes sense that register machines only saves > us the bytecode dispatching. > > How much that is compared to the work each instruction requires, I cannot > say. Maybe, Yury has a better understanding here. > > It also assumes the VM is executing the bytecodes directly. If >> there is a JIT involved, it all gets translated into something >> else anyway, and then it's more a matter of whether you find >> it easier to design the JIT to deal with stack or register code. >> > > It seems like Yury thinks so. He didn't tell use so far. > > > Best, > Sven > > ___ > Python-Dev mailing list > Python-Dev@python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > https://mail.python.org/mailman/options/python-dev/pludemann%40google.com > ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Speeding up CPython 5-10%
On Tue, 2 Feb 2016 at 01:29 Victor Stinnerwrote: > Hi, > > I'm back for the FOSDEM event at Bruxelles, it was really cool. I gave > talk about FAT Python and I got good feedback. But friends told me > that people now have expectations on FAT Python. It looks like people > care of Python performance :-) > > FYI the slides of my talk: > https://github.com/haypo/conf/raw/master/2016-FOSDEM/fat_python.pdf > (a video was recorded, I don't know when it will be online) > > I take a first look at your patch and sorry, I'm skeptical about the > design. I have to play with it a little bit more to check if there is > no better design. > > To be clear, FAT Python with your work looks more and more like a > cheap JIT compiler :-) Guards, specializations, optimizing at runtime > after a threshold... all these things come from JIT compilers. I like > the idea of a kind-of JIT compiler without having to pay the high cost > of a large dependency like LLVM. I like baby steps in CPython, it's > faster, it's possible to implement it in a single release cycle (one > minor Python release, Python 3.6). Integrating a JIT compiler into > CPython already failed with Unladen Swallow :-/ > > PyPy has a complete different design (and has serious issues with the > Python C API), Pyston is restricted to Python 2.7, Pyjion looks > specific to Windows (CoreCLR), Numba is specific to numeric > computations (numpy). IMHO none of these projects can be easily be > merged into CPython "quickly" (again, in a single Python release > cycle). By the way, Pyjion still looks very young (I heard that they > are still working on the compatibility with CPython, not on > performance yet). > We are not ready to have a serious discussion about Pyjion yet as we are still working on compatibility (we have a talk proposal in for PyCon US 2016 and so we are hoping to have something to discuss at the language summit), but Victor's email shows there is some misconceptions about it already and a misunderstanding of our fundamental goal. First off, Pyjion is very much a work-in-progress. You can find it at https://github.com/microsoft/pyjion (where there is an FAQ), but for this audience the key thing to know is that we are still working on compatibility (see https://github.com/Microsoft/Pyjion/blob/master/Tests/python_tests.txt for the list of tests we do (not) pass from the Python test suite). Out of our roughly 400 tests, we don't pass about 18 of them. Second, we have not really started work on performance yet. We have done some very low-hanging fruit stuff, but just barely. IOW we are not really ready to discuss performance (ATM we JIT instantly for all code objects and even being that aggressive with the JIT overhead we are even/slightly slower than an unmodified Python 3.5 VM, so we are hopeful this work will pan out). Third, the over-arching goal of Pyjion is not to add a JIT into CPython, but to add a C API to CPython that will allow plugging in a JIT. If you simply JIT code objects then the API required to let someone plug in a JIT is basically three functions, maybe as little as two (you can see the exact patch against CPython that we are working with at https://github.com/Microsoft/Pyjion/blob/master/Patches/python.diff). We have no interest in shipping a JIT with CPython, just making it much easier to let others add one if they want to because it makes sense for their workload. We have no plans to suggest shipping a JIT with CPython, just to make it an option for people to add in if they want (and if Yury's caching stuff goes in with an execution counter then even the one bit of true overhead we had will be part of CPython already which makes it even more of an easy decision to consider the API we will eventually propose). Fourth, it is not Windows-only by design. CoreCLR is cross-platform on all major OSs, so that is not a restriction (and honestly we are using CoreCLR simply because Dino used to work on the CLR team so he knows the bytecode really well; we easily could have used some other JIT to prove our point). The only reason Pyjion doesn't work with other OSs is momenum/laziness on Dino and my part; Dino hacked together Pyjion at PyCon US 2015 and he is the most comfortable on Windows, and so he just did it in Windows on Visual Studio and just didn't bother to start with e.g., CMake to make it build on other OSs. Since we are still trying to work out some compatibility stuff so we would rather do that than worry about Linux or OS X support right now. Fifth, if we manage to show that a C API can easily be added to CPython to make a JIT something that can simply be plugged in and be useful, then we will also have a basic JIT framework for people to use. As I said, our use of CoreCLR is just for ease of development. There is no reason we couldn't use ChakraCore, v8, LLVM, etc. But since all of these JIT compilers would need to know how to handle CPython bytecode, we have tried to design a framework where JIT
Re: [Python-Dev] Speeding up CPython 5-10%
On 2016-02-02 4:28 AM, Victor Stinner wrote: [..] I take a first look at your patch and sorry, Thanks for the initial code review! I'm skeptical about the design. I have to play with it a little bit more to check if there is no better design. So far I see two things you are worried about: 1. The cache is attached to the code object vs function/frame. I think the code object is the perfect place for such a cache. The cache must be there (and survive!) "across" the frames. If you attach it to the function object, you'll have to re-attach it to a frame object on each PyEval call. I can't see how that would be better. 2. Two levels of indirection in my cache -- offsets table + cache table. In my other email thread "Opcode cache in ceval loop" I explained that optimizing every code object in the standard library and unittests adds 5% memory overhead. Optimizing only those that are called frequently is less than 1%. Besides, many functions that you import are never called, or only called once or twice. And code objects for modules and class bodies are called once. If we don't use an offset table and just allocate a cache entry for every opcode, then the memory usage will raise *significantly*. Right now the overhead of the offset table is *8 bits* per opcode, the overhead of the cache table is *32 bytes* per an optimized opcode. The overhead of using 1 extra indirection is minimal. [..] 2016-01-27 19:25 GMT+01:00 Yury Selivanov: tl;dr The summary is that I have a patch that improves CPython performance up to 5-10% on macro benchmarks. Benchmarks results on Macbook Pro/Mac OS X, desktop CPU/Linux, server CPU/Linux are available at [1]. There are no slowdowns that I could reproduce consistently. That's really impressive, great job Yury :-) Getting non-negligible speedup on large macrobenchmarks became really hard in CPython. CPython is already well optimized in all corners. It looks like the overall Python performance still depends heavily on the performance of dictionary and attribute lookups. Even if it was well known, I didn't expect up to 10% speedup on *macro* benchmarks. Thanks! LOAD_METHOD & CALL_METHOD - We had a lot of conversations with Victor about his PEP 509, and he sent me a link to his amazing compilation of notes about CPython performance [2]. One optimization that he pointed out to me was LOAD/CALL_METHOD opcodes, an idea first originated in PyPy. There is a patch that implements this optimization, it's tracked here: [3]. There are some low level details that I explained in the issue, but I'll go over the high level design in this email as well. Your cache is stored directly in code objects. Currently, code objects are immutable. Code objects are immutable on the Python level. My cache doesn't make any previously immutable field mutable. Adding a few mutable cache structures visible only at the C level is acceptable I think. Antoine Pitrou's patch adding a LOAD_GLOBAL cache adds a cache to functions with an "alias" in each frame object: http://bugs.python.org/issue10401 Andrea Griffini's patch also adding a cache for LOAD_GLOBAL adds a cache for code objects too. https://bugs.python.org/issue1616125 Those patches are nice, but optimizing just LOAD_GLOBAL won't give you a big speed-up. For instance, 2to3 became 7-8% faster once I started to optimize LOAD_ATTR. The idea of my patch is that it implements caching in such a way, that we can add it to several different opcodes. The opcodes we want to optimize are LAOD_GLOBAL, 0 and 3. Let's look at the first one, that loads the 'print' function from builtins. The opcode knows the following bits of information: I tested your latest patch. It looks like LOAD_GLOBAL never invalidates the cache on cache miss ("deoptimize" the instruction). Yes, that was a deliberate decision (but we can add the deoptimization easily). So far I haven't seen a use case or benchmark where we really need to deoptimize. I suggest to always invalidate the cache at each cache miss. Not only, it's common to modify global variables, but there is also the issue of different namespace used with the same code object. Examples: * late global initialization. See for example _a85chars cache of base64.a85encode. * code object created in a temporary namespace and then always run in a different global namespace. See for example collections.namedtuple(). I'm not sure that it's the best example because it looks like the Python code only loads builtins, not globals. But it looks like your code keeps a copy of the version of the global namespace dict. I tested with a threshold of 1: always optimize all code objects. Maybe with your default threshold of 1024 runs, the issue with different namespaces doesn't occur in practice. Yep. I added a constant in ceval.c that enables collection of opcode cache stats. 99.9% of all global dicts in benchmarks are stable. test suite was a
Re: [Python-Dev] Speeding up CPython 5-10%
Hi, I'm back for the FOSDEM event at Bruxelles, it was really cool. I gave talk about FAT Python and I got good feedback. But friends told me that people now have expectations on FAT Python. It looks like people care of Python performance :-) FYI the slides of my talk: https://github.com/haypo/conf/raw/master/2016-FOSDEM/fat_python.pdf (a video was recorded, I don't know when it will be online) I take a first look at your patch and sorry, I'm skeptical about the design. I have to play with it a little bit more to check if there is no better design. To be clear, FAT Python with your work looks more and more like a cheap JIT compiler :-) Guards, specializations, optimizing at runtime after a threshold... all these things come from JIT compilers. I like the idea of a kind-of JIT compiler without having to pay the high cost of a large dependency like LLVM. I like baby steps in CPython, it's faster, it's possible to implement it in a single release cycle (one minor Python release, Python 3.6). Integrating a JIT compiler into CPython already failed with Unladen Swallow :-/ PyPy has a complete different design (and has serious issues with the Python C API), Pyston is restricted to Python 2.7, Pyjion looks specific to Windows (CoreCLR), Numba is specific to numeric computations (numpy). IMHO none of these projects can be easily be merged into CPython "quickly" (again, in a single Python release cycle). By the way, Pyjion still looks very young (I heard that they are still working on the compatibility with CPython, not on performance yet). 2016-01-27 19:25 GMT+01:00 Yury Selivanov: > tl;dr The summary is that I have a patch that improves CPython performance > up to 5-10% on macro benchmarks. Benchmarks results on Macbook Pro/Mac OS > X, desktop CPU/Linux, server CPU/Linux are available at [1]. There are no > slowdowns that I could reproduce consistently. That's really impressive, great job Yury :-) Getting non-negligible speedup on large macrobenchmarks became really hard in CPython. CPython is already well optimized in all corners. It looks like the overall Python performance still depends heavily on the performance of dictionary and attribute lookups. Even if it was well known, I didn't expect up to 10% speedup on *macro* benchmarks. > LOAD_METHOD & CALL_METHOD > - > > We had a lot of conversations with Victor about his PEP 509, and he sent me > a link to his amazing compilation of notes about CPython performance [2]. > One optimization that he pointed out to me was LOAD/CALL_METHOD opcodes, an > idea first originated in PyPy. > > There is a patch that implements this optimization, it's tracked here: [3]. > There are some low level details that I explained in the issue, but I'll go > over the high level design in this email as well. Your cache is stored directly in code objects. Currently, code objects are immutable. Antoine Pitrou's patch adding a LOAD_GLOBAL cache adds a cache to functions with an "alias" in each frame object: http://bugs.python.org/issue10401 Andrea Griffini's patch also adding a cache for LOAD_GLOBAL adds a cache for code objects too. https://bugs.python.org/issue1616125 I don't know what is the best place to store the cache. I vaguely recall a patch which uses a single unique global cache, but maybe I'm wrong :-p > The opcodes we want to optimize are LAOD_GLOBAL, 0 and 3. Let's look at the > first one, that loads the 'print' function from builtins. The opcode knows > the following bits of information: I tested your latest patch. It looks like LOAD_GLOBAL never invalidates the cache on cache miss ("deoptimize" the instruction). I suggest to always invalidate the cache at each cache miss. Not only, it's common to modify global variables, but there is also the issue of different namespace used with the same code object. Examples: * late global initialization. See for example _a85chars cache of base64.a85encode. * code object created in a temporary namespace and then always run in a different global namespace. See for example collections.namedtuple(). I'm not sure that it's the best example because it looks like the Python code only loads builtins, not globals. But it looks like your code keeps a copy of the version of the global namespace dict. I tested with a threshold of 1: always optimize all code objects. Maybe with your default threshold of 1024 runs, the issue with different namespaces doesn't occur in practice. > A straightforward way to implement such a cache is simple, but consumes a > lot of memory, that would be just wasted, since we only need such a cache > for LOAD_GLOBAL and LOAD_METHOD opcodes. So we have to be creative about the > cache design. I'm not sure that it's worth to develop a complex dynamic logic to only enable optimizations after a threshold (design very close to a JIT compiler). What is the overhead (% of RSS memory) on a concrete application when all code objects are optimized at
Re: [Python-Dev] Speeding up CPython 5-10%
On 02.02.2016 00:27, Greg Ewing wrote: Sven R. Kunze wrote: Are there some resources on why register machines are considered faster than stack machines? If a register VM is faster, it's probably because each register instruction does the work of about 2-3 stack instructions, meaning less trips around the eval loop, so less unpredictable branches and less pipeline flushes. That's was I found so far as well. This assumes that bytecode dispatching is a substantial fraction of the time taken to execute each instruction. For something like cpython, where the operations carried out by the bytecodes involve a substantial amount of work, this may not be true. Interesting point indeed. It makes sense that register machines only saves us the bytecode dispatching. How much that is compared to the work each instruction requires, I cannot say. Maybe, Yury has a better understanding here. It also assumes the VM is executing the bytecodes directly. If there is a JIT involved, it all gets translated into something else anyway, and then it's more a matter of whether you find it easier to design the JIT to deal with stack or register code. It seems like Yury thinks so. He didn't tell use so far. Best, Sven ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Speeding up CPython 5-10%
On 01.02.2016 19:28, Brett Cannon wrote: A search for [stack vs register based virtual machine] will get you some information. Alright. :) Will go for that. You aren't really supposed to yet. :) In Pyjion's case we are still working on compatibility, let alone trying to show a speed improvement so we have not said much beyond this mailing list (we have a talk proposal in for PyCon US that we hope gets accepted). We just happened to get picked up on Reddit and HN recently and so interest has spiked in the project. Exciting. :) So, it could be that we will see a jitted CPython when Pyjion appears to be successful? The ability to plug in a JIT, but yes, that's the hope. Okay. Not sure what you mean by plugin. One thing I like about Python is that it just works. So, plugin sounds like unnecessary work. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Speeding up CPython 5-10%
On Mon, 1 Feb 2016 at 09:08 Yury Selivanovwrote: > > > On 2016-01-29 11:28 PM, Steven D'Aprano wrote: > > On Wed, Jan 27, 2016 at 01:25:27PM -0500, Yury Selivanov wrote: > >> Hi, > >> > >> > >> tl;dr The summary is that I have a patch that improves CPython > >> performance up to 5-10% on macro benchmarks. Benchmarks results on > >> Macbook Pro/Mac OS X, desktop CPU/Linux, server CPU/Linux are available > >> at [1]. There are no slowdowns that I could reproduce consistently. > > Have you looked at Cesare Di Mauro's wpython? As far as I know, it's now > > unmaintained, and the project repo on Google Code appears to be dead (I > > get a 404), but I understand that it was significantly faster than > > CPython back in the 2.6 days. > > > > > https://wpython.googlecode.com/files/Beyond%20Bytecode%20-%20A%20Wordcode-based%20Python.pdf > > > > > > Thanks for bringing this up! > > IIRC wpython was about using "fat" bytecodes, i.e. using 64bits per > bytecode instead of 8. That allows to minimize the number of bytecodes, > thus having some performance increase. TBH, I don't think it was > "significantly faster". > > If I were to do some big refactoring of the ceval loop, I'd probably > consider implementing a register VM. While register VMs are a bit > faster than stack VMs (up to 20-30%), they would also allow us to apply > more optimizations, and even bolt on a simple JIT compiler. > If you did tackle the register VM approach that would also settle a long-standing question of whether a certain optimization works for Python. As for bolting on a JIT, the whole point of Pyjion is to see if that's worth it for CPython, so that's already being taken care of (and is actually easier with a stack-based VM since the JIT engine we're using is stack-based itself). ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Speeding up CPython 5-10%
On 01.02.2016 18:18, Brett Cannon wrote: On Mon, 1 Feb 2016 at 09:08 Yury Selivanov> wrote: On 2016-01-29 11:28 PM, Steven D'Aprano wrote: > On Wed, Jan 27, 2016 at 01:25:27PM -0500, Yury Selivanov wrote: >> Hi, >> >> >> tl;dr The summary is that I have a patch that improves CPython >> performance up to 5-10% on macro benchmarks. Benchmarks results on >> Macbook Pro/Mac OS X, desktop CPU/Linux, server CPU/Linux are available >> at [1]. There are no slowdowns that I could reproduce consistently. > Have you looked at Cesare Di Mauro's wpython? As far as I know, it's now > unmaintained, and the project repo on Google Code appears to be dead (I > get a 404), but I understand that it was significantly faster than > CPython back in the 2.6 days. > > https://wpython.googlecode.com/files/Beyond%20Bytecode%20-%20A%20Wordcode-based%20Python.pdf > > Thanks for bringing this up! IIRC wpython was about using "fat" bytecodes, i.e. using 64bits per bytecode instead of 8. That allows to minimize the number of bytecodes, thus having some performance increase. TBH, I don't think it was "significantly faster". If I were to do some big refactoring of the ceval loop, I'd probably consider implementing a register VM. While register VMs are a bit faster than stack VMs (up to 20-30%), they would also allow us to apply more optimizations, and even bolt on a simple JIT compiler. If you did tackle the register VM approach that would also settle a long-standing question of whether a certain optimization works for Python. Are there some resources on why register machines are considered faster than stack machines? As for bolting on a JIT, the whole point of Pyjion is to see if that's worth it for CPython, so that's already being taken care of (and is actually easier with a stack-based VM since the JIT engine we're using is stack-based itself). Interesting. Haven't noticed these projects, yet. So, it could be that we will see a jitted CPython when Pyjion appears to be successful? Best, Sven ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Speeding up CPython 5-10%
On Mon, 1 Feb 2016 at 10:21 Sven R. Kunzewrote: > > > On 01.02.2016 18:18, Brett Cannon wrote: > > > > On Mon, 1 Feb 2016 at 09:08 Yury Selivanov < > yselivanov...@gmail.com> wrote: > >> >> >> On 2016-01-29 11:28 PM, Steven D'Aprano wrote: >> > On Wed, Jan 27, 2016 at 01:25:27PM -0500, Yury Selivanov wrote: >> >> Hi, >> >> >> >> >> >> tl;dr The summary is that I have a patch that improves CPython >> >> performance up to 5-10% on macro benchmarks. Benchmarks results on >> >> Macbook Pro/Mac OS X, desktop CPU/Linux, server CPU/Linux are available >> >> at [1]. There are no slowdowns that I could reproduce consistently. >> > Have you looked at Cesare Di Mauro's wpython? As far as I know, it's now >> > unmaintained, and the project repo on Google Code appears to be dead (I >> > get a 404), but I understand that it was significantly faster than >> > CPython back in the 2.6 days. >> > >> > >> https://wpython.googlecode.com/files/Beyond%20Bytecode%20-%20A%20Wordcode-based%20Python.pdf >> > >> > >> >> Thanks for bringing this up! >> >> IIRC wpython was about using "fat" bytecodes, i.e. using 64bits per >> bytecode instead of 8. That allows to minimize the number of bytecodes, >> thus having some performance increase. TBH, I don't think it was >> "significantly faster". >> >> If I were to do some big refactoring of the ceval loop, I'd probably >> consider implementing a register VM. While register VMs are a bit >> faster than stack VMs (up to 20-30%), they would also allow us to apply >> more optimizations, and even bolt on a simple JIT compiler. >> > > If you did tackle the register VM approach that would also settle a > long-standing question of whether a certain optimization works for Python. > > > Are there some resources on why register machines are considered faster > than stack machines? > A search for [stack vs register based virtual machine] will get you some information. > > > As for bolting on a JIT, the whole point of Pyjion is to see if that's > worth it for CPython, so that's already being taken care of (and is > actually easier with a stack-based VM since the JIT engine we're using is > stack-based itself). > > > Interesting. Haven't noticed these projects, yet. > You aren't really supposed to yet. :) In Pyjion's case we are still working on compatibility, let alone trying to show a speed improvement so we have not said much beyond this mailing list (we have a talk proposal in for PyCon US that we hope gets accepted). We just happened to get picked up on Reddit and HN recently and so interest has spiked in the project. > > So, it could be that we will see a jitted CPython when Pyjion appears to > be successful? > The ability to plug in a JIT, but yes, that's the hope. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Speeding up CPython 5-10%
On 01/02/2016 16:54, Yury Selivanov wrote: On 2016-01-29 11:28 PM, Steven D'Aprano wrote: On Wed, Jan 27, 2016 at 01:25:27PM -0500, Yury Selivanov wrote: Hi, tl;dr The summary is that I have a patch that improves CPython performance up to 5-10% on macro benchmarks. Benchmarks results on Macbook Pro/Mac OS X, desktop CPU/Linux, server CPU/Linux are available at [1]. There are no slowdowns that I could reproduce consistently. Have you looked at Cesare Di Mauro's wpython? As far as I know, it's now unmaintained, and the project repo on Google Code appears to be dead (I get a 404), but I understand that it was significantly faster than CPython back in the 2.6 days. https://wpython.googlecode.com/files/Beyond%20Bytecode%20-%20A%20Wordcode-based%20Python.pdf Thanks for bringing this up! IIRC wpython was about using "fat" bytecodes, i.e. using 64bits per bytecode instead of 8. That allows to minimize the number of bytecodes, thus having some performance increase. TBH, I don't think it was "significantly faster". From https://code.google.com/archive/p/wpython/ WPython is a re-implementation of (some parts of) Python, which drops support for bytecode in favour of a wordcode-based model (where a is word is 16 bits wide). It also implements an hybrid stack-register virtual machine, and adds a lot of other optimizations. -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Speeding up CPython 5-10%
Sven R. Kunze wrote: Are there some resources on why register machines are considered faster than stack machines? If a register VM is faster, it's probably because each register instruction does the work of about 2-3 stack instructions, meaning less trips around the eval loop, so less unpredictable branches and less pipeline flushes. This assumes that bytecode dispatching is a substantial fraction of the time taken to execute each instruction. For something like cpython, where the operations carried out by the bytecodes involve a substantial amount of work, this may not be true. It also assumes the VM is executing the bytecodes directly. If there is a JIT involved, it all gets translated into something else anyway, and then it's more a matter of whether you find it easier to design the JIT to deal with stack or register code. -- Greg ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Speeding up CPython 5-10%
Hi Brett, On 2016-02-01 12:18 PM, Brett Cannon wrote: On Mon, 1 Feb 2016 at 09:08 Yury Selivanov> wrote: [..] If I were to do some big refactoring of the ceval loop, I'd probably consider implementing a register VM. While register VMs are a bit faster than stack VMs (up to 20-30%), they would also allow us to apply more optimizations, and even bolt on a simple JIT compiler. [..] As for bolting on a JIT, the whole point of Pyjion is to see if that's worth it for CPython, so that's already being taken care of (and is actually easier with a stack-based VM since the JIT engine we're using is stack-based itself). Sure, I have very high hopes for Pyjion and Pyston. I really hope that Microsoft and Dropbox will keep pushing. Yury ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Speeding up CPython 5-10%
On 2016-01-29 11:28 PM, Steven D'Aprano wrote: On Wed, Jan 27, 2016 at 01:25:27PM -0500, Yury Selivanov wrote: Hi, tl;dr The summary is that I have a patch that improves CPython performance up to 5-10% on macro benchmarks. Benchmarks results on Macbook Pro/Mac OS X, desktop CPU/Linux, server CPU/Linux are available at [1]. There are no slowdowns that I could reproduce consistently. Have you looked at Cesare Di Mauro's wpython? As far as I know, it's now unmaintained, and the project repo on Google Code appears to be dead (I get a 404), but I understand that it was significantly faster than CPython back in the 2.6 days. https://wpython.googlecode.com/files/Beyond%20Bytecode%20-%20A%20Wordcode-based%20Python.pdf Thanks for bringing this up! IIRC wpython was about using "fat" bytecodes, i.e. using 64bits per bytecode instead of 8. That allows to minimize the number of bytecodes, thus having some performance increase. TBH, I don't think it was "significantly faster". If I were to do some big refactoring of the ceval loop, I'd probably consider implementing a register VM. While register VMs are a bit faster than stack VMs (up to 20-30%), they would also allow us to apply more optimizations, and even bolt on a simple JIT compiler. Yury ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Speeding up CPython 5-10%
On 01.02.2016 17:54, Yury Selivanov wrote: If I were to do some big refactoring of the ceval loop, I'd probably consider implementing a register VM. While register VMs are a bit faster than stack VMs (up to 20-30%), they would also allow us to apply more optimizations, and even bolt on a simple JIT compiler. How do JIT and register machine related to each other? :) Best, Sven ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Speeding up CPython 5-10%
Hi Yury, > An off-topic: have you ever tried hg.python.org/benchmarks > or compare MicroPython vs CPython? I'm curious if MicroPython > is faster -- in that case we'll try to copy some optimization > ideas. I've tried a small number of those benchmarks, but not in any rigorous way, and not enough to compare properly with CPython. Maybe one day I (or someone) will get to it and report results :) One thing that makes MP fast is the use of pointer tagging and stuffing of small integers within object pointers. Thus integer arithmetic below 2**30 (on 32-bit arch) requires no heap. > Do you use opcode dictionary caching only for LOAD_GLOBAL-like > opcodes? Do you have an equivalent of LOAD_FAST, or you use > dicts to store local variables? The opcodes that have dict caching are: LOAD_NAME LOAD_GLOBAL LOAD_ATTR STORE_ATTR LOAD_METHOD (not implemented yet in mainline repo) For local variables we use LOAD_FAST and STORE_FAST (and DELETE_FAST). Actually, there are 16 dedicated opcodes for loading from positions 0-15, and 16 for storing to these positions. Eg: LOAD_FAST_0 LOAD_FAST_1 ... Mostly this is done to save RAM, since LOAD_FAST_0 is 1 byte. > If we change the opcode size, it will probably affect libraries > that compose or modify code objects. Modules like "dis" will > also need to be updated. And that's probably just a tip of the > iceberg. > > We can still implement your approach if we add a separate > private 'unsigned char' array to each code object, so that > LOAD_GLOBAL can store the key offsets. It should be a bit > faster than my current patch, since it has one less level > of indirection. But this way we loose the ability to > optimize LOAD_METHOD, simply because it requires more memory > for its cache. In any case, I'll experiment! Problem with that approach (having a separate array for offset_guess) is that how do you know where to look into that array for a given LOAD_GLOBAL opcode? The second LOAD_GLOBAL in your bytecode should look into the second entry in the array, but how does it know? I'd love to experiment implementing my original caching idea with CPython, but no time! Cheers, Damien. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Speeding up CPython 5-10%
Yury Selivanov schrieb am 27.01.2016 um 19:25: > tl;dr The summary is that I have a patch that improves CPython performance > up to 5-10% on macro benchmarks. Benchmarks results on Macbook Pro/Mac OS > X, desktop CPU/Linux, server CPU/Linux are available at [1]. There are no > slowdowns that I could reproduce consistently. > > There are two different optimizations that yield this speedup: > LOAD_METHOD/CALL_METHOD opcodes and per-opcode cache in ceval loop. > > LOAD_METHOD & CALL_METHOD > - > > We had a lot of conversations with Victor about his PEP 509, and he sent me > a link to his amazing compilation of notes about CPython performance [2]. > One optimization that he pointed out to me was LOAD/CALL_METHOD opcodes, an > idea first originated in PyPy. > > There is a patch that implements this optimization, it's tracked here: > [3]. There are some low level details that I explained in the issue, but > I'll go over the high level design in this email as well. > > Every time you access a method attribute on an object, a BoundMethod object > is created. It is a fairly expensive operation, despite a freelist of > BoundMethods (so that memory allocation is generally avoided). The idea is > to detect what looks like a method call in the compiler, and emit a pair of > specialized bytecodes for that. > > So instead of LOAD_GLOBAL/LOAD_ATTR/CALL_FUNCTION we will have > LOAD_GLOBAL/LOAD_METHOD/CALL_METHOD. > > LOAD_METHOD looks at the object on top of the stack, and checks if the name > resolves to a method or to a regular attribute. If it's a method, then we > push the unbound method object and the object to the stack. If it's an > attribute, we push the resolved attribute and NULL. > > When CALL_METHOD looks at the stack it knows how to call the unbound method > properly (pushing the object as a first arg), or how to call a regular > callable. > > This idea does make CPython faster around 2-4%. And it surely doesn't make > it slower. I think it's a safe bet to at least implement this optimization > in CPython 3.6. > > So far, the patch only optimizes positional-only method calls. It's > possible to optimize all kind of calls, but this will necessitate 3 more > opcodes (explained in the issue). We'll need to do some careful > benchmarking to see if it's really needed. I implemented a similar but simpler optimisation in Cython a while back: http://blog.behnel.de/posts/faster-python-calls-in-cython-021.html Instead of avoiding the creation of method objects, as you proposed, it just normally calls getattr and if that returns a bound method object, it uses inlined calling code that avoids re-packing the argument tuple. Interestingly, I got speedups of 5-15% for some of the Python benchmarks, but I don't quite remember which ones (at least raytrace and richards, I think), nor do I recall the overall gain, which (I assume) is what you are referring to with your 2-4% above. Might have been in the same order. Stefan ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Speeding up CPython 5-10%
On 2016-01-29 5:00 AM, Stefan Behnel wrote: Yury Selivanov schrieb am 27.01.2016 um 19:25: [..] LOAD_METHOD looks at the object on top of the stack, and checks if the name resolves to a method or to a regular attribute. If it's a method, then we push the unbound method object and the object to the stack. If it's an attribute, we push the resolved attribute and NULL. When CALL_METHOD looks at the stack it knows how to call the unbound method properly (pushing the object as a first arg), or how to call a regular callable. This idea does make CPython faster around 2-4%. And it surely doesn't make it slower. I think it's a safe bet to at least implement this optimization in CPython 3.6. So far, the patch only optimizes positional-only method calls. It's possible to optimize all kind of calls, but this will necessitate 3 more opcodes (explained in the issue). We'll need to do some careful benchmarking to see if it's really needed. I implemented a similar but simpler optimisation in Cython a while back: http://blog.behnel.de/posts/faster-python-calls-in-cython-021.html Instead of avoiding the creation of method objects, as you proposed, it just normally calls getattr and if that returns a bound method object, it uses inlined calling code that avoids re-packing the argument tuple. Interestingly, I got speedups of 5-15% for some of the Python benchmarks, but I don't quite remember which ones (at least raytrace and richards, I think), nor do I recall the overall gain, which (I assume) is what you are referring to with your 2-4% above. Might have been in the same order. That's great! I'm still working on the patch, but so far it looks like adding just LOAD_METHOD/CALL_METHOD (that avoid instantiating BoundMethods) gives us 10-15% faster method calls. Combining them with my opcode cache makes them 30-35% faster. Yury ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Speeding up CPython 5-10%
Hi Damien, BTW I just saw (and backed!) your new Kickstarter campaign to port MicroPython to ESP8266, good stuff! On 2016-01-29 7:38 AM, Damien George wrote: Hi Yury, [..] Do you use opcode dictionary caching only for LOAD_GLOBAL-like opcodes? Do you have an equivalent of LOAD_FAST, or you use dicts to store local variables? The opcodes that have dict caching are: LOAD_NAME LOAD_GLOBAL LOAD_ATTR STORE_ATTR LOAD_METHOD (not implemented yet in mainline repo) For local variables we use LOAD_FAST and STORE_FAST (and DELETE_FAST). Actually, there are 16 dedicated opcodes for loading from positions 0-15, and 16 for storing to these positions. Eg: LOAD_FAST_0 LOAD_FAST_1 ... Mostly this is done to save RAM, since LOAD_FAST_0 is 1 byte. Interesting. This might actually make CPython slightly faster too. Worth trying. If we change the opcode size, it will probably affect libraries that compose or modify code objects. Modules like "dis" will also need to be updated. And that's probably just a tip of the iceberg. We can still implement your approach if we add a separate private 'unsigned char' array to each code object, so that LOAD_GLOBAL can store the key offsets. It should be a bit faster than my current patch, since it has one less level of indirection. But this way we loose the ability to optimize LOAD_METHOD, simply because it requires more memory for its cache. In any case, I'll experiment! Problem with that approach (having a separate array for offset_guess) is that how do you know where to look into that array for a given LOAD_GLOBAL opcode? The second LOAD_GLOBAL in your bytecode should look into the second entry in the array, but how does it know? I've changed my approach a little bit. Now I have a simple function [1] to initialize the cache for code objects that are called frequently enough. It walks through the code object's opcodes and creates the appropriate offset/cache tables. Then, in ceval loop I have a couple of convenient macros to work with the cache [2]. They use INSTR_OFFSET() macro to locate the cache entry via the offset table initialized by [1]. Thanks, Yury [1] https://github.com/1st1/cpython/blob/opcache4/Objects/codeobject.c#L167 [2] https://github.com/1st1/cpython/blob/opcache4/Python/ceval.c#L1164 ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Speeding up CPython 5-10%
On Wed, Jan 27, 2016 at 01:25:27PM -0500, Yury Selivanov wrote: > Hi, > > > tl;dr The summary is that I have a patch that improves CPython > performance up to 5-10% on macro benchmarks. Benchmarks results on > Macbook Pro/Mac OS X, desktop CPU/Linux, server CPU/Linux are available > at [1]. There are no slowdowns that I could reproduce consistently. Have you looked at Cesare Di Mauro's wpython? As far as I know, it's now unmaintained, and the project repo on Google Code appears to be dead (I get a 404), but I understand that it was significantly faster than CPython back in the 2.6 days. https://wpython.googlecode.com/files/Beyond%20Bytecode%20-%20A%20Wordcode-based%20Python.pdf -- Steve ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Speeding up CPython 5-10%
Hi Yuri, I think these are great ideas to speed up CPython. They are probably the simplest yet most effective ways to get performance improvements in the VM. MicroPython has had LOAD_METHOD/CALL_METHOD from the start (inspired by PyPy, and the main reason to have it is because you don't need to allocate on the heap when doing a simple method call). The specific opcodes are: LOAD_METHOD # same behaviour as you propose CALL_METHOD # for calls with positional and/or keyword args CALL_METHOD_VAR_KW # for calls with one or both of */** We also have LOAD_ATTR, CALL_FUNCTION and CALL_FUNCTION_VAR_KW for non-method calls. MicroPython also has dictionary lookup caching, but it's a bit different to your proposal. We do something much simpler: each opcode that has a cache ability (eg LOAD_GLOBAL, STORE_GLOBAL, LOAD_ATTR, etc) includes a single byte in the opcode which is an offset-guess into the dictionary to find the desired element. Eg for LOAD_GLOBAL we have (pseudo code): CASE(LOAD_GLOBAL): key = DECODE_KEY; offset_guess = DECODE_BYTE; if (global_dict[offset_guess].key == key) { // found the element straight away } else { // not found, do a full lookup and save the offset offset_guess = dict_lookup(global_dict, key); UPDATE_BYTECODE(offset_guess); } PUSH(global_dict[offset_guess].elem); We have found that such caching gives a massive performance increase, on the order of 20%. The issue (for us) is that it increases bytecode size by a considerable amount, requires writeable bytecode, and can be non-deterministic in terms of lookup time. Those things are important in the embedded world, but not so much on the desktop. Good luck with it! Regards, Damien. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Speeding up CPython 5-10%
BTW, this optimization also makes some old optimization tricks obsolete. 1. No need to write 'def func(len=len)'. Globals lookups will be fast. 2. No need to save bound methods: obj = [] obj_append = obj.append for _ in range(10**6): obj_append(something) This hand-optimized code would only be marginally faster, because of LOAD_METHOD and how it's cached. Yury ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Speeding up CPython 5-10%
On 2016-01-27 3:46 PM, Glenn Linderman wrote: On 1/27/2016 12:37 PM, Yury Selivanov wrote: MicroPython also has dictionary lookup caching, but it's a bit different to your proposal. We do something much simpler: each opcode that has a cache ability (eg LOAD_GLOBAL, STORE_GLOBAL, LOAD_ATTR, etc) includes a single byte in the opcode which is an offset-guess into the dictionary to find the desired element. Eg for LOAD_GLOBAL we have (pseudo code): CASE(LOAD_GLOBAL): key = DECODE_KEY; offset_guess = DECODE_BYTE; if (global_dict[offset_guess].key == key) { // found the element straight away } else { // not found, do a full lookup and save the offset offset_guess = dict_lookup(global_dict, key); UPDATE_BYTECODE(offset_guess); } PUSH(global_dict[offset_guess].elem); We have found that such caching gives a massive performance increase, on the order of 20%. The issue (for us) is that it increases bytecode size by a considerable amount, requires writeable bytecode, and can be non-deterministic in terms of lookup time. Those things are important in the embedded world, but not so much on the desktop. That's a neat idea! You're right, it does require bytecode to become writeable. Would it? Remember "fixup lists"? Maybe they still exist for loading function addresses from one DLL into the code of another at load time? So the equivalent for bytecode requires a static table of offset_guess, and the offsets into that table are allocated by the byte-code loader at byte-code load time, and the byte-code is "fixed up" at load time to use the correct offsets into the offset_guess table. It takes one more indirection to find the guess, but if the result is a 20% improvement, maybe you'd still get 19%... Right, in my current patch I have an offset table per code object. Essentially, this offset table adds 8bits per opcode. It also means that only first 255 LOAD_GLOBAL/LOAD_METHOD opcodes *per-code-object* are optimized (because the offset table only can store 8bit offsets), which is usually enough (I think you need to have more than a 500 lines of code function to reach that limit). Yury ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Speeding up CPython 5-10%
On 1/27/2016 12:37 PM, Yury Selivanov wrote: MicroPython also has dictionary lookup caching, but it's a bit different to your proposal. We do something much simpler: each opcode that has a cache ability (eg LOAD_GLOBAL, STORE_GLOBAL, LOAD_ATTR, etc) includes a single byte in the opcode which is an offset-guess into the dictionary to find the desired element. Eg for LOAD_GLOBAL we have (pseudo code): CASE(LOAD_GLOBAL): key = DECODE_KEY; offset_guess = DECODE_BYTE; if (global_dict[offset_guess].key == key) { // found the element straight away } else { // not found, do a full lookup and save the offset offset_guess = dict_lookup(global_dict, key); UPDATE_BYTECODE(offset_guess); } PUSH(global_dict[offset_guess].elem); We have found that such caching gives a massive performance increase, on the order of 20%. The issue (for us) is that it increases bytecode size by a considerable amount, requires writeable bytecode, and can be non-deterministic in terms of lookup time. Those things are important in the embedded world, but not so much on the desktop. That's a neat idea! You're right, it does require bytecode to become writeable. Would it? Remember "fixup lists"? Maybe they still exist for loading function addresses from one DLL into the code of another at load time? So the equivalent for bytecode requires a static table of offset_guess, and the offsets into that table are allocated by the byte-code loader at byte-code load time, and the byte-code is "fixed up" at load time to use the correct offsets into the offset_guess table. It takes one more indirection to find the guess, but if the result is a 20% improvement, maybe you'd still get 19%... ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Speeding up CPython 5-10%
On Wed, 27 Jan 2016 at 10:26 Yury Selivanovwrote: > Hi, > > > tl;dr The summary is that I have a patch that improves CPython > performance up to 5-10% on macro benchmarks. Benchmarks results on > Macbook Pro/Mac OS X, desktop CPU/Linux, server CPU/Linux are available > at [1]. There are no slowdowns that I could reproduce consistently. > > There are twodifferent optimizations that yield this speedup: > LOAD_METHOD/CALL_METHOD opcodes and per-opcode cache in ceval loop. > > > LOAD_METHOD & CALL_METHOD > - > > We had a lot of conversations with Victor about his PEP 509, and he sent > me a link to his amazing compilation of notes about CPython performance > [2]. One optimization that he pointed out to me was LOAD/CALL_METHOD > opcodes, an idea first originated in PyPy. > > There is a patch that implements this optimization, it's tracked here: > [3]. There are some low level details that I explained in the issue, > but I'll go over the high level design in this email as well. > > Every time you access a method attribute on an object, a BoundMethod > object is created. It is a fairly expensive operation, despite a > freelist of BoundMethods (so that memory allocation is generally > avoided). The idea is to detect what looks like a method call in the > compiler, and emit a pair of specialized bytecodes for that. > > So instead of LOAD_GLOBAL/LOAD_ATTR/CALL_FUNCTION we will have > LOAD_GLOBAL/LOAD_METHOD/CALL_METHOD. > > LOAD_METHOD looks at the object on top of the stack, and checks if the > name resolves to a method or to a regular attribute. If it's a method, > then we push the unbound method object and the object to the stack. If > it's an attribute, we push the resolved attribute and NULL. > > When CALL_METHOD looks at the stack it knows how to call the unbound > method properly (pushing the object as a first arg), or how to call a > regular callable. > > This idea does make CPython faster around 2-4%. And it surely doesn't > make it slower. I think it's a safe bet to at least implement this > optimization in CPython 3.6. > > So far, the patch only optimizes positional-only method calls. It's > possible to optimize all kind of calls, but this will necessitate 3 more > opcodes (explained in the issue). We'll need to do some careful > benchmarking to see if it's really needed. > > > Per-opcode cache in ceval > - > > While reading PEP 509, I was thinking about how we can use > dict->ma_version in ceval to speed up globals lookups. One of the key > assumptions (and this is what makes JITs possible) is that real-life > programs don't modify globals and rebind builtins (often), and that most > code paths operate on objects of the same type. > > In CPython, all pure Python functions have code objects. When you call > a function, ceval executes its code object in a frame. Frames contain > contextual information, including pointers to the globals and builtins > dict. The key observation here is that almost all code objects always > have same pointers to the globals (the module they were defined in) and > to the builtins. And it's not a good programming practice to mutate > globals or rebind builtins. > > Let's look at this function: > > def spam(): > print(ham) > > Here are its opcodes: > >2 0 LOAD_GLOBAL 0 (print) >3 LOAD_GLOBAL 1 (ham) >6 CALL_FUNCTION1 (1 positional, 0 keyword pair) >9 POP_TOP > 10 LOAD_CONST 0 (None) > 13 RETURN_VALUE > > The opcodes we want to optimize are LAOD_GLOBAL, 0 and 3. Let's look at > the first one, that loads the 'print' function from builtins. The > opcode knows the following bits of information: > > - its offset (0), > - its argument (0 -> 'print'), > - its type (LOAD_GLOBAL). > > And these bits of information will *never* change. So if this opcode > could resolve the 'print' name (from globals or builtins, likely the > latter) and save the pointer to it somewhere, along with > globals->ma_version and builtins->ma_version, it could, on its second > call, just load this cached info back, check that the globals and > builtins dict haven't changed and push the cached ref to the stack. > That would save it from doing two dict lookups. > > We can also optimize LOAD_METHOD. There are high chances, that 'obj' in > 'obj.method()' will be of the same type every time we execute the code > object. So if we'd have an opcodes cache, LOAD_METHOD could then cache > a pointer to the resolved unbound method, a pointer to obj.__class__, > and tp_version_tag of obj.__class__. Then it would only need to check > if the cached object type is the same (and that it wasn't modified) and > that obj.__dict__ doesn't override 'method'. Long story short, this > caching really speeds up method calls on types implemented in C. > list.append becomes very fast, because list
Re: [Python-Dev] Speeding up CPython 5-10%
Hi Yury, (Sorry for misspelling your name previously!) > Yes, we'll need to add CALL_METHOD{_VAR|_KW|etc} opcodes to optimize all > kind of method calls. However, I'm not sure how big the impact will be, > need to do more benchmarking. I never did such fine grained analysis with MicroPython. I don't think there are many uses of * and ** that it'd be worth it, but definitely there are lots of uses of plain keywords. Also, you'd want to consider how simple/complex it is to treat all these different opcodes in the compiler. For us, it's simpler to treat everything the same. Otherwise your LOAD_METHOD part of the compiler will need to peek deep into the AST to see what kind of call it is. > BTW, how do you benchmark MicroPython? Haha, good question! Well, we use Pystone 1.2 (unmodified) to do basic benchmarking, and find it to be quite good. We track our code live at: http://micropython.org/resources/code-dashboard/ You can see there the red line, which is the Pystone result. There was a big jump around Jan 2015 which is when we introduced opcode dictionary caching. And since then it's been very gradually increasing due to small optimisations here and there. Pystone is actually a great benchmark for embedded systems because it gives very reliable results there (almost zero variation across runs) and if we can squeeze 5 more Pystones out with some change then we know that it's a good optimisation (for efficiency at least). For us, low RAM usage and small code size are the most important factors, and we track those meticulously. But in fact, smaller code size quite often correlates with more efficient code because there's less to execute and it fits in the CPU cache (at least on the desktop). We do have some other benchmarks, but they are highly specialised for us. For example, how fast can you bit bang a GPIO pin using pure Python code. Currently we get around 200kHz on a 168MHz MCU, which shows that pure (Micro)Python code is about 100 times slower than C. > That's a neat idea! You're right, it does require bytecode to become > writeable. I considered implementing a similar strategy, but this would > be a big change for CPython. So I decided to minimize the impact of the > patch and leave the opcodes untouched. I think you need to consider "big" changes, especially ones like this that can have a great (and good) impact. But really, this is a behind-the-scenes change that *should not* affect end users, and so you should not have any second thoughts about doing it. One problem I see with CPython is that it exposes way too much to the user (both Python programmer and C extension writer) and this hurts both language evolution (you constantly need to provide backwards compatibility) and ability to optimise. Cheers, Damien. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Speeding up CPython 5-10%
Damien, On 2016-01-27 4:20 PM, Damien George wrote: Hi Yury, (Sorry for misspelling your name previously!) NP. As long as the first letter is "y" I don't care ;) Yes, we'll need to add CALL_METHOD{_VAR|_KW|etc} opcodes to optimize all kind of method calls. However, I'm not sure how big the impact will be, need to do more benchmarking. I never did such fine grained analysis with MicroPython. I don't think there are many uses of * and ** that it'd be worth it, but definitely there are lots of uses of plain keywords. Also, you'd want to consider how simple/complex it is to treat all these different opcodes in the compiler. For us, it's simpler to treat everything the same. Otherwise your LOAD_METHOD part of the compiler will need to peek deep into the AST to see what kind of call it is. BTW, how do you benchmark MicroPython? Haha, good question! Well, we use Pystone 1.2 (unmodified) to do basic benchmarking, and find it to be quite good. We track our code live at: http://micropython.org/resources/code-dashboard/ The dashboard is cool! An off-topic: have you ever tried hg.python.org/benchmarks or compare MicroPython vs CPython? I'm curious if MicroPython is faster -- in that case we'll try to copy some optimization ideas. You can see there the red line, which is the Pystone result. There was a big jump around Jan 2015 which is when we introduced opcode dictionary caching. And since then it's been very gradually increasing due to small optimisations here and there. Do you use opcode dictionary caching only for LOAD_GLOBAL-like opcodes? Do you have an equivalent of LOAD_FAST, or you use dicts to store local variables? That's a neat idea! You're right, it does require bytecode to become writeable. I considered implementing a similar strategy, but this would be a big change for CPython. So I decided to minimize the impact of the patch and leave the opcodes untouched. I think you need to consider "big" changes, especially ones like this that can have a great (and good) impact. But really, this is a behind-the-scenes change that *should not* affect end users, and so you should not have any second thoughts about doing it. If we change the opcode size, it will probably affect libraries that compose or modify code objects. Modules like "dis" will also need to be updated. And that's probably just a tip of the iceberg. We can still implement your approach if we add a separate private 'unsigned char' array to each code object, so that LOAD_GLOBAL can store the key offsets. It should be a bit faster than my current patch, since it has one less level of indirection. But this way we loose the ability to optimize LOAD_METHOD, simply because it requires more memory for its cache. In any case, I'll experiment! One problem I see with CPython is that it exposes way too much to the user (both Python programmer and C extension writer) and this hurts both language evolution (you constantly need to provide backwards compatibility) and ability to optimise. Right. Even though CPython explicitly states that opcodes and code objects might change in the future, we still have to be careful about changing them. Yury ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Speeding up CPython 5-10%
On 2016-01-27 3:10 PM, Damien George wrote: Hi Yuri, I think these are great ideas to speed up CPython. They are probably the simplest yet most effective ways to get performance improvements in the VM. Thanks! MicroPython has had LOAD_METHOD/CALL_METHOD from the start (inspired by PyPy, and the main reason to have it is because you don't need to allocate on the heap when doing a simple method call). The specific opcodes are: LOAD_METHOD # same behaviour as you propose CALL_METHOD # for calls with positional and/or keyword args CALL_METHOD_VAR_KW # for calls with one or both of */** We also have LOAD_ATTR, CALL_FUNCTION and CALL_FUNCTION_VAR_KW for non-method calls. Yes, we'll need to add CALL_METHOD{_VAR|_KW|etc} opcodes to optimize all kind of method calls. However, I'm not sure how big the impact will be, need to do more benchmarking. BTW, how do you benchmark MicroPython? MicroPython also has dictionary lookup caching, but it's a bit different to your proposal. We do something much simpler: each opcode that has a cache ability (eg LOAD_GLOBAL, STORE_GLOBAL, LOAD_ATTR, etc) includes a single byte in the opcode which is an offset-guess into the dictionary to find the desired element. Eg for LOAD_GLOBAL we have (pseudo code): CASE(LOAD_GLOBAL): key = DECODE_KEY; offset_guess = DECODE_BYTE; if (global_dict[offset_guess].key == key) { // found the element straight away } else { // not found, do a full lookup and save the offset offset_guess = dict_lookup(global_dict, key); UPDATE_BYTECODE(offset_guess); } PUSH(global_dict[offset_guess].elem); We have found that such caching gives a massive performance increase, on the order of 20%. The issue (for us) is that it increases bytecode size by a considerable amount, requires writeable bytecode, and can be non-deterministic in terms of lookup time. Those things are important in the embedded world, but not so much on the desktop. That's a neat idea! You're right, it does require bytecode to become writeable. I considered implementing a similar strategy, but this would be a big change for CPython. So I decided to minimize the impact of the patch and leave the opcodes untouched. Thanks! Yury ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Speeding up CPython 5-10%
Hi, tl;dr The summary is that I have a patch that improves CPython performance up to 5-10% on macro benchmarks. Benchmarks results on Macbook Pro/Mac OS X, desktop CPU/Linux, server CPU/Linux are available at [1]. There are no slowdowns that I could reproduce consistently. There are twodifferent optimizations that yield this speedup: LOAD_METHOD/CALL_METHOD opcodes and per-opcode cache in ceval loop. LOAD_METHOD & CALL_METHOD - We had a lot of conversations with Victor about his PEP 509, and he sent me a link to his amazing compilation of notes about CPython performance [2]. One optimization that he pointed out to me was LOAD/CALL_METHOD opcodes, an idea first originated in PyPy. There is a patch that implements this optimization, it's tracked here: [3]. There are some low level details that I explained in the issue, but I'll go over the high level design in this email as well. Every time you access a method attribute on an object, a BoundMethod object is created. It is a fairly expensive operation, despite a freelist of BoundMethods (so that memory allocation is generally avoided). The idea is to detect what looks like a method call in the compiler, and emit a pair of specialized bytecodes for that. So instead of LOAD_GLOBAL/LOAD_ATTR/CALL_FUNCTION we will have LOAD_GLOBAL/LOAD_METHOD/CALL_METHOD. LOAD_METHOD looks at the object on top of the stack, and checks if the name resolves to a method or to a regular attribute. If it's a method, then we push the unbound method object and the object to the stack. If it's an attribute, we push the resolved attribute and NULL. When CALL_METHOD looks at the stack it knows how to call the unbound method properly (pushing the object as a first arg), or how to call a regular callable. This idea does make CPython faster around 2-4%. And it surely doesn't make it slower. I think it's a safe bet to at least implement this optimization in CPython 3.6. So far, the patch only optimizes positional-only method calls. It's possible to optimize all kind of calls, but this will necessitate 3 more opcodes (explained in the issue). We'll need to do some careful benchmarking to see if it's really needed. Per-opcode cache in ceval - While reading PEP 509, I was thinking about how we can use dict->ma_version in ceval to speed up globals lookups. One of the key assumptions (and this is what makes JITs possible) is that real-life programs don't modify globals and rebind builtins (often), and that most code paths operate on objects of the same type. In CPython, all pure Python functions have code objects. When you call a function, ceval executes its code object in a frame. Frames contain contextual information, including pointers to the globals and builtins dict. The key observation here is that almost all code objects always have same pointers to the globals (the module they were defined in) and to the builtins. And it's not a good programming practice to mutate globals or rebind builtins. Let's look at this function: def spam(): print(ham) Here are its opcodes: 2 0 LOAD_GLOBAL 0 (print) 3 LOAD_GLOBAL 1 (ham) 6 CALL_FUNCTION1 (1 positional, 0 keyword pair) 9 POP_TOP 10 LOAD_CONST 0 (None) 13 RETURN_VALUE The opcodes we want to optimize are LAOD_GLOBAL, 0 and 3. Let's look at the first one, that loads the 'print' function from builtins. The opcode knows the following bits of information: - its offset (0), - its argument (0 -> 'print'), - its type (LOAD_GLOBAL). And these bits of information will *never* change. So if this opcode could resolve the 'print' name (from globals or builtins, likely the latter) and save the pointer to it somewhere, along with globals->ma_version and builtins->ma_version, it could, on its second call, just load this cached info back, check that the globals and builtins dict haven't changed and push the cached ref to the stack. That would save it from doing two dict lookups. We can also optimize LOAD_METHOD. There are high chances, that 'obj' in 'obj.method()' will be of the same type every time we execute the code object. So if we'd have an opcodes cache, LOAD_METHOD could then cache a pointer to the resolved unbound method, a pointer to obj.__class__, and tp_version_tag of obj.__class__. Then it would only need to check if the cached object type is the same (and that it wasn't modified) and that obj.__dict__ doesn't override 'method'. Long story short, this caching really speeds up method calls on types implemented in C. list.append becomes very fast, because list doesn't have a __dict__, so the check is very cheap (with cache). A straightforward way to implement such a cache is simple, but consumes a lot of memory, that would be just wasted, since we only need such
Re: [Python-Dev] Speeding up CPython 5-10%
On 2016-01-27 3:01 PM, Brett Cannon wrote: [..] We can also optimize LOAD_METHOD. There are high chances, that 'obj' in 'obj.method()' will be of the same type every time we execute the code object. So if we'd have an opcodes cache, LOAD_METHOD could then cache a pointer to the resolved unbound method, a pointer to obj.__class__, and tp_version_tag of obj.__class__. Then it would only need to check if the cached object type is the same (and that it wasn't modified) and that obj.__dict__ doesn't override 'method'. Long story short, this caching really speeds up method calls on types implemented in C. list.append becomes very fast, because list doesn't have a __dict__, so the check is very cheap (with cache). What would it take to make this work with Python-defined classes? It already works for Python-defined classes. But it's a bit more expensive because you still have to check object's __dict__. Still, there is a very noticeable performance increase (see the results of benchmark runs). I guess that would require knowing the version of the instance's __dict__, the instance's __class__ version, the MRO, and where the method object was found in the MRO and any intermediary classes to know if it was suddenly shadowed? I think that's everything. :) No, unfortunately we can't use the version of the instance's __dict__ as it is very volatile. The current implementation of opcode cache works because types are much more stable. Remember, the cache is per *code object*, so it should work for all times when code object is executed. class F: def spam(self): self.ham() # <- version of self.__dict__ is unstable #so we'll endup invalidating the cache #too often __class__ version, MRO changes etc are covered by tp_version_tag, which I use as one of guards. Obviously that's a lot, but I wonder how many classes have a deep inheritance model vs. inheriting only from `object`? In that case you only have to check self.__dict__.ma_version, self.__class__, self.__class__.__dict__.ma_version, and self.__class__.__class__ == `type`. I guess another way to look at this is to get an idea of how complex do the checks have to get before caching something like this is not worth it (probably also depends on how often you mutate self.__dict__ thanks to mutating attributes, but you could in that instance just decide to always look at self.__dict__ for the method's key and then do the ma_version cache check for everything coming from the class). Otherwise we can consider looking at the the caching strategies that Self helped pioneer (http://bibliography.selflanguage.org/) that all of the various JS engines lifted and consider caching all method lookups. Yeah, hidden classes are great. But the infrastructure to support them properly is huge. I think that to make them work you'll need a JIT -- to trace, deoptimize, optimize, and do it all with a reasonable memory footprint. My patch is much smaller and simpler, something we can realistically tune and ship in 3.6. A straightforward way to implement such a cache is simple, but consumes a lot of memory, that would be just wasted, since we only need such a cache for LOAD_GLOBAL and LOAD_METHOD opcodes. So we have to be creative about the cache design. Here's what I came up with: 1. We add a few fields to the code object. 2. ceval will count how many times each code object is executed. 3. When the code object is executed over ~900 times, we mark it as "hot". What happens if you simply consider all code as hot? Is the overhead of building the mapping such that you really need this, or is this simply to avoid some memory/startup cost? That's the first step for this patch. I think we need to profile several big applications (I'll do it later for some of my code bases) and see how big is the memory impact if we optimize everything. In any case, I expect it to be noticeable (which may be acceptable), so we'll probably try to optimize it. We also create an 'unsigned char' array "MAPPING", with length set to match the length of the code object. So we have a 1-to-1 mapping between opcodes and MAPPING array. 4. Next ~100 calls, while the code object is "hot", LOAD_GLOBAL and LOAD_METHOD do "MAPPING[opcode_offset()]++". 5. After 1024 calls to the code object, ceval loop will iterate through the MAPPING, counting all opcodes that were executed more than 50 times. Where did the "50 times" boundary come from? Was this measured somehow or did you just guess at a number? If the number is too low, then you'll optimize code in branches that are rarely executed. So I picked 50, because I only trace opcodes for 100 calls. All of those numbers can be (should be?) changed, and I think we should experiment with different heuristics.
Re: [Python-Dev] Speeding up CPython 5-10%
As Brett suggested, I've just run the benchmarks suite with memory tracking on. The results are here: https://gist.github.com/1st1/1851afb2773526fd7c58 Looks like the memory increase is around 1%. One synthetic micro-benchmark, unpack_sequence, contains hundreds of lines that load a global variable and does nothing else, consumes 5%. Yury ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com