[issue4941] Tell GCC Py_DECREF is unlikely to call the destructor
Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment: Probably #if the definitions of Py_LIKELY and Py_UNLIKELY instead of __builtin_expect so new compilers can easily add their own definitions. This was done in the first version, but with the currently supported compilers it's simpler to do like that, because both GCC and ICC support the same __builtin_expect syntax, so you get less code this way. Anyway, the code was inspired from the Linux kernel which only supports those two compilers, so anybody more knowledgeable is welcome to suggest how to express this with other compilers. ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4941 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4753] Faster opcode dispatch on gcc
Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment: -fno-gcse is controversial. Even if it might avoid jumps sharing, the impact of that option has to be measured, since common subexpression elimination allows omitting some recalculations, so disabling global CSE might have a negative impact on other code. It would be maybe better to disable GCSE only for the interpreter loop, but that would make some intra-file inlining impossible. ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4753 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4941] Tell GCC Py_DECREF is unlikely to call the destructor
Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment: Yep, agreed. It could be quite cool to rely on __attribute__((cold)) to mark format_exc_check_arg(), but that only works with newer gcc's. I guess I'll add many likely() by hand - in some cases GCC might already get it right, but the heuristics used to choose the likely path statically are quite arguable and subject to change. Also, on other archs such hints might have a bigger impact on the generated code. I'll give a look at that not earlier than February, or you're welcome to try this. However, at least on x86, an if (successCondition) { success(); dispatch(); } failure(); will be probably compiled to assembly like follows, equivalent to adding likely to the test: evaluate successCondition if_false goto err: call to success(); dispatch next instrutions. err: call to failure(); ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4941 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4896] Faster why variable manipulation in ceval.c
Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment: Given a 10% speedup on some systems, and statistically insignificant changes on other systems, I would still apply the patch, even simply because the bitmask part simply makes more sense. I'm not sure about the goto part, but still, it does straighten the code. Anyway, simply call the label why_is_WHY_NOT, why_set_to_WHY_NOT or something like that. Verbosity on a use-once label used with goto should be encouraged - we're not Java programmer, but we need to pay for using goto by increasing readability in other ways. @collinwinter: since the differences you report are so low (similar to the statistical noise I get on my machine), I would expect that you're just getting statistical noise instead of different results depending on the GCC version, unless you performed statistical hypothesis testing (confidence intervals and related stuff). And if I had done the needed tens/hundreds of repetitions for hypothesis testing, I'd state that clearly, so I suppose you didn't, and that's fully acceptable since the result is likely to be statistically insignificant anyway. ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4896 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4941] Tell GCC Py_DECREF is unlikely to call the destructor
Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment: If speedup on other machines are confirmed, a slowdown of less than 2% should be acceptable (it's also similar to the statistic noise I have on my machine - it's a pity pybench does not calculate the standard deviation of the execution time). It is true that many objects are short lived, and that makes generational GC work, but the patch is about another point. Do most refcount decrements lead to garbage collection? That's a completely different question. And I don't think that popping a function call argument usually leads to a destructor invocation, for instance, unless the argument is a temporary. The same reasoning holds for most other opcodes. The simplest way to check this is to look at the compilation output with Profile-Guided Optimization active and verify which branch is optimized. Having said that, the slowdown you get (on your Athlon X2 3600+) might be due to a higher cost in case of failed static prediction, since that leads to 2 jumps. This code leads to 0 or 1 jump: if (likely(cond)) { destructor } while with unlikely, one runs through either 0 or 2 jumps. The second jump is unconditional, so it has a much smaller delay (the decode unit has to tell the fetch unit to fetch another instruction). However, in that case one has to invoke anyway the destructor through a virtual call, which is much more expensive. ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4941 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4753] Faster opcode dispatch on gcc
Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment: A couple percent maybe is not worth vmgen-ing. But even if I'm not a vmgen expert, I read many papers from Ertl about superinstructions and replication, so the expected speedup from vmgen'ing is much bigger. Is there some more advanced feature we are not using and could use? Have the previous static predictions be converted to superinstructions? Have other common sequences be treated like that? Is there an existing discussion on this? ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4753 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4896] Faster why variable manipulation in ceval.c
Changes by Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com: -- nosy: +blaisorblade ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4896 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4753] Faster opcode dispatch on gcc
Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment: Ok, then vmgen adds almost just direct threading instead of indirect threading. Since the purpose of superinstructions is to eliminate dispatch overhead, and that's more important when little actual work is done, what about all ones which unconditionally end with FAST_DISPATCH and are common? DUP_TOP, POP_TOP, DUP_TOPX(2,3) and other stack handling staff which can't fail? To have any of them + XXX without error handling problems? Removing handling of DUP_TOPX{4,5} is implied, you shouldn't check functionality of the compiler during interpretation - indeed, even the idea using a parameter for that is a waste. Have DUP_TOPX2 and DUP_TOPX3, like JVM, is just simpler. Replication would be trickier since we want the bytecode generation to be deterministic, but it's probably doable too. Bytecode conversion during I/O is perfectly fine, to convert from the actual bytecode to one of the chosen replicas. Conversion in a rescan pass can be also OK (less cache friendly though, so if it's easy to avoid, please do). ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4753 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4753] Faster opcode dispatch on gcc
Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment: The standing question is still: can we get ICC to produce the expected output? It looks like we still didn't manage, and since ICC is the best compiler out there, this matters. Some problems with SunCC, even if it doesn't do jump sharing, it seems that one doesn't get the speedups - I guess that on most platforms we should select the most common alternative for interpreters (i.e. no switch, one jump table, given by threadedceval5.patch + abstract-switch-reduced.diff). On core platforms we can spend time on fine-tuning - and the definition of core platforms is given by do developers want to test for that?. When that's fixed, I think that we just have to choose the simpler form and merge that. @alexandre: [about removing the switch] There is no speed difference on pybench on x86; on x86-64, the code is slower due to the opcode fetching change. Actually, on my machine it looks like the difference is caused by the different layout caused by switch removal, or something like that, because fixing the opcode fetching doesn't make a difference here (see below). Indeed, I did my benchmarking duties. Results are that abstract-switch-reduced.diff (the one removing the switch) gives a 1-3% slowdown, and that all the others don't make a significant difference. The differences in the assembly output seem to be due to a different code layout for some branches, I didn't have a closer look. However, experimenting with -falign-labels=16 can give a small speedup, I'm trying to improve the results (what I actually want is to align just the opcode handlers, I'll probably do that by hand). reenable-static-prediction can give either a slowdown or a speedup by around 1%, i.e. around the statistical noise. Note that on my machine, I get only a 10% speedup with the base patch, and that is more reasonable here. In the original thread on PyPy-dev, I got a 20% one with the Python interpreter I built for my student project, since that one is faster* (by a 2-3x factor, like PyVM), so the dispatch cost is more significant, and reducing it has a bigger impact. In fact, I couldn't believe that Python got the same speedup. This is a Core 2 Duo T7200 (Merom) in 64bit mode with 4MB of L2 cache, and since it's a laptop I expect it to have slower RAM than a desktop. @alexandre: The patch make a huge difference on 64-bit Linux. I get a 20% speed-up and the lowest run time so far. That is quite impressive! Which processor is that? @pitrou: The machine I got the 15% speedup on is in 64-bit mode with gcc 4.3.2. Which is the processor? I guess the bigger speedups should be on Pentium4, since it has the bigger mispredict penalties. *DISCLAIMER: the interpreter of our group (me and Sigurd Meldgaard) is not complete, has some bugs, and the source code has not yet been published, so discussion about why it is faster shall not happen here - I want to avoid any flame. I believe it's not because of skipped runtime checks or such stuff, but because we used garbage collection instead of refcounting, indirect threading and tagged integers, but I don't have time to discuss that yet. The original thread on pypy-dev has some insights if you are interested on this. ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4753 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4753] Faster opcode dispatch on gcc
Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment: Same for CPU-specific tuning: I don't think we want to ship Python with compiler flags which depend on the particular CPU being used. I wasn't suggesting this - but since different CPUs have different optimization rules, something like oh, 20% performance slowdown on PowerPC or on P4 is important to know (and yeah, configure options are a good solution). Which is the barrier for platform-specific tricks, as long as the code is still portable? I'd like to experiment with __builtin_expect and with manual alignment (through 'asm volatile(.p2align 4)' on x86/x86_64 with GAS - PPC might need a different alignment probably). All hidden through macros to make it disappear on unsupported platforms, without any configure option for them (there shouldn't be the need for that). I doubt many people compile Python with icc, honestly. Yep :-(. rantWhy don't distributors do it?/rant (First culprit might be license/compatibility problems I guess, but the speedup would be worth the time to fix the troubles IMHO). ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4753 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4753] Faster opcode dispatch on gcc
Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment: @ ajaksu2 Applying your patches makes no difference with gcc 4.2 and gives a barely noticeable (~2%) slowdown with icc. Your patches is something quite unclear :-) Which are the patch sets you are comparing? And on 32 or 64 bits? But does Yonah supports 64bits? IIRC no, but I'm not sure. I would be surprised from slowdowns for restore-old-oparg-load.diff, really surprised. And I would be just surprised by slowdowns on reenable-static-prediction.diff. Also, about ICC output, we still need to ensure that it's properly compiled (see above the instructions for counting jmp * or similar). In the measurements above, ICC did miscompile the patch with the switch. By properly compiled I mean that separate indirect branches are generated, instead of just one. These results are from a Celeron M 410 (Core Solo Yonah-based), so it's a rather old platform to run benchmarks on. Not really - at the very least we should listen to results on Pentium 4, Core (i.e. Yonah) and Core 2, and I would also add Pentium3/Pentium M to represent the P6 family. Anyway, I have to do my benchmarks on this, I hope this weekend I'll have time. ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4753 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4753] Faster opcode dispatch on gcc
Changes by Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com: Added file: http://bugs.python.org/file12634/restore-old-oparg-load.diff ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4753 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4753] Faster opcode dispatch on gcc
Changes by Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com: Added file: http://bugs.python.org/file12633/abstract-switch-reduced.diff ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4753 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4753] Faster opcode dispatch on gcc
Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment: I finally implemented my suggestion for the switch elimination. On top of threadedceval5.patch, apply abstract-switch-reduced.diff and then restore-old-oparg-load.diff to test it. This way, only computed goto's are used. I would like who had miscompilation problems, or didn't get advantage from the patch, to try compiling and benchmarking this version. I've also been able to reenable static prediction (PREDICT_*) on top of computed gotos, and that may help CPU prediction even more (the BTB for the computed goto will be used to predict the 2nd most frequent target); obviously it may instead cause a slowdown, I'd need stats on opcode frequency to try guessing in advance (I'll try gathering them later through DYNAMIC_EXECUTION_PROFILE). Apply reenable-static-prediction.diff on top of the rest to get this. I'll have to finish other stuff before closing everything to run pybench, I can't get stable timings otherwise, so it'll take some time (too busy, sorry). However I ran the check for regressions and they show none. abstract-switch-reduced.diff is the fixed abstract-switch.diff - actually there was just one hunk which changed the handling of f_lasti, and that looked extraneous. See the end of the message. --- a/Python/ceval.cThu Jan 01 23:54:01 2009 +0100 +++ b/Python/ceval.cSun Jan 04 14:21:16 2009 -0500 @@ -1063,12 +1072,12 @@ } fast_next_opcode: - f-f_lasti = INSTR_OFFSET(); /* line-by-line tracing support */ if (_Py_TracingPossible tstate-c_tracefunc != NULL !tstate-tracing) { + f-f_lasti = INSTR_OFFSET(); /* see maybe_call_line_trace for expository comments */ f-f_stacktop = stack_pointer; Added file: http://bugs.python.org/file12635/reenable-static-prediction.diff ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4753 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4753] Faster opcode dispatch on gcc
Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment: @skip: In simple words, the x86 call: call 0x2000 placed at address 0x1000 becomes: call %rip + 0x1000 RIP holds the instruction pointer, which will be 0x1000 in this case (actually, I'm ignoring the detail that when executing the call, RIP points to the first byte of the next instruction). If I execute the same instruction from a different location (i.e. different RIP), things will break. So, only code for opcodes without real calls, nor access to globals can be copied like this (inlines are OK). With refcounting, not even POP_TOP is safe since it can call destructors. DUP_TOP is still safe, I guess. ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4753 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4753] Faster opcode dispatch on gcc
Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment: @pitrou: ranting mode Argh, reference counting hinders even that? /ranting mode I just discovered another problem caused by refcounting. Various techniques allow to create binary code from the interpreter binary, by just pasting together the code for the common interpreters cases and producing calls to the other. But, guess what, on most platforms (except plain x86, but including x86_64 and maybe x86 for the shared library case) this does not work if the copied code includes function calls (on x86_64 that's due to RIP-relative addressing, and on similar issues on other platforms). So, Python could not even benefit from that! That's a real pity... I'll have to try with subroutine threading, to see if that's faster than indirect threading on current platforms or if it is still slower. ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4753 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4753] Faster opcode dispatch on gcc
Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment: @Alexandre: So, can you try dropping the switch altogether, using always computed goto and seeing how does the resulting code get compiled? Removing the switch won't be possible unless we change the semantic EXTENDED_ARG. In addition, I doubt the improvement, if any, would worth the increased complexity. OK, it's time that I post code to experiment with that - there is no need to break EXTENDED_ARG. And the point is to fight miscompilations. Do you actually mean the time spent interpreting bytecodes compared to the time spent in the other parts of Python? If so, your figures are wrong for CPython on x86-64. It is about 50% just like on x86 (when running pybench). With the patch, this drops to 35% on x86-64 and to 45% on x86. More or less, I mean that, but I was making an example, and I made up reasonable figures. 70%, or even more, just for _dispatch_ (i.e. just for the mispredicted indirect jump), is valid for real-world Smalltalk interpreters for instance, or for the ones in The Structure and Performance of Efficient Interpreters. But, when you say intepreting opcodes, I do not know which part you refer to, if just the computed goto or for the whole code in the interpreter function. ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4753 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4753] Faster opcode dispatch on gcc
Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment: @Skip: if one decides to generate binary code, there is no need to use switches. Inline threading (also known as code copying in some research papers) is what you are probably looking for: http://blog.mozilla.com/dmandelin/2008/08/27/inline-threading-tracemonkey-etc/ For references and background on threading techniques mentioned there, see: http://en.wikipedia.org/wiki/Threaded_code http://www.complang.tuwien.ac.at/forth/threaded-code.html ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4753 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4753] Faster opcode dispatch on gcc
Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment: @alexandre: if you add two labels per opcode and two dispatch tables, one before (like now) and one after the parameter fetch (where we have the 'case'), you can keep the same speed. And under the hood we also had two dispatch tables before, with the switch, so it's not a big deal; finally, the second table is only used in the slow path (i.e. EXTENDED_ARG, or when additional checks are needed). About f-last_i, when I have time I want to try optimizing it. Somewhere you can be sure it's not going to be used. But you have some changes about that in the abstract-switch patch, I think that was not intended? ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4753 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4753] Faster opcode dispatch on gcc
Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment: 1st note: is that code from the threaded version? Note that you need to modify the source to make it accept also ICC to try that. In case you already did that, I guess the patch is not useful at all with ICC since, as far as I can see, the jump is shared. It is vital to this patch that the jump is not shared, something similar to -fno-crossjumping should be found. 2nd note: the answer to your questions seems yes, ICC has less register spills. Look for instance at: movl-272(%ebp), %ecx movzbl (%ecx), %eax addl$1, %ecx and movzbl(%esi), %ecx incl %esi This represents the increment of the program counter after loading the next opcode. In the code you posted, one can see that the program counter is spilled to memory by GCC, but isn't by ICC. Either the spill is elsewhere, or ICC is better here. And it's widely known that ICC has a much better optimizer in many cases, and I remember that GCC register allocator really needs improvement. Finally, I'm a bit surprised by addl $1, %ecx, since any peephole optimizer should remove that; I'm not shocked just because I've never seen perfect GCC output. ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4753 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4753] Faster opcode dispatch on gcc
Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment: Daniel, I forgot to ask for the compilation command line you used, since they make a lot of difference. Can you post them? Thanks ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4753 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4753] Faster opcode dispatch on gcc
Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment: About miscompilations: the current patch is a bit weird for GCC, because you keep both the switch and the computed goto. But actually, there is no case in which the switch is needed, and computed goto give less room to GCC's choices. So, can you try dropping the switch altogether, using always computed goto and seeing how does the resulting code get compiled? I see you'll need two labels (before and after argument fetch) per opcode and two dispatch tabels, but that's no big deal (except for code alignment - align just the common branch target). An important warning is that by default, on my system, GCC 4.2 aligns branch targets for switch to a 16-byte boundary (as recommended by the Intel optimization guide), by adding a .p2align 4,,7 GAS directive, and it does not do that for computed goto. Adding the directive by hand gave a small speedup, 2% I think; I should try -falign-jumps=16 if it's not enabled (some -falign-jumps is enabled by -O2), since that is supposed to give the same result. Please use that yourself as well, and verify it works for labels, even if I fear it doesn't. However, I don't know why the speed up due to the patch is much more significant on x86-64 than on x86. It's Amdahl's law, even if this is not about parallel code. When the rest is faster (x86_64), the same speedup on dispatch gives a bigger overall speedup. To be absolutely clear: x86_64 has more registers, so the rest of the interpreter is faster than x86, but dispatch still takes the same absolute time, which is 70% on x86_64, but only 50% on x86 (those are realistic figures); if this patch halved dispatch time on both (we're not so lucky), we would save 35% on x86_64 but only 25% on x86. In fact, on inefficient interpreters, indirect threading is useless altogether. So, do those extra register help _so_ much? Yes. In my toy interpreter, computing last_i for each dispatch doesn't give any big slowdown, but storing it in f-last_i gives a ~20% slowdown - I cross-checked multiple times because I was astonished. Conversely, when the program counter had to be stored in memory, I think it was like 2x slower. ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4753 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4753] Faster opcode dispatch on gcc
Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment: I'm not an expert in this kind of optimizations. Could we gain more speed by making the dispatcher table more dense? Python has less than 128 opcodes (len(opcode.opmap) == 113) so they can be squeezed in a smaller table. I naively assume a smaller table increases the amount of cache hits. Well, you have no binary compatibility constraint with a new release, so it can be tried and benchmarked, or it can be done anyway! On x86_64 the impact of the jump table is 8 bytes per pointer * 256 pointers = 2KiB, and the L1 data cache of Pentium4 can be 8KiB or 16KiB wide. But I don't expect this to be noticeable in most synthetic microbenchmarks. Matrix multiplication would be the perfect one I guess; the repeated column access would kill the L1 data cache, if the whole matrixes don't fit. ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4753 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4753] Faster opcode dispatch on gcc
Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment: We would have to change opcode.h for this to be truely useful (in order to re-use OPCODE_LIST()). Yep. I think that should be the subject of a separate bug entry for code reorganization. Agreed, I'll maybe try to find time for it. Thanks for all the explanation and pointers! You're welcome, thanks to you for writing the patch! And About register allocation, I wonder if the temporary variables u,v,w could be declared separately in each opcode rather than at the top of the eval function, so that the compiler doesn't try to store their values between two opcodes. I didn't look at the output, but that shouldn't be needed with decent optimizers, since they are local variables, so the compiler has a full picture of their usage (this does not happen with the content of the heap, where the frame object may lie). I think that change would just save some compilation time for dataflow analysis, maybe :-). Or could make clearer which variables is used where, but that is a matter of taste (what's there is fine for me). I just studied liveness analysis in compilers, and it computes whether a variable is live before and after each statement; if the value of a variable is not used in some piece of code until the next write to the variable, it is considered dead in that piece of code, and that variable does not take space; since u, v, w are either unused or are written to before usage in all opcodes, they're dead at the beginning of each opcode, so they're also dead just before dispatch. The only important thing is that the content of the jump table are known to the compiler and that the compiler makes use of that. Simply passing a non-const jump table to some function defined elsewhere (which could in theory modify it) would make the output much worse. ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4753 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4753] Faster opcode dispatch on gcc
Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment: I attached some additional benchmarks on SunOS. So far, it seems the benefits of the proposed optimization are highly compiler-dependent. Well, it would be more correct to say that as you verified for GCC 3.4, miscompilation of the code happens easily. Any literature research shows that threading in a fast interpreter does help. My experience shows two exceptions to this rule: a) bad compiler output b) interpreters which are not efficient enough - when other operations are even slower than instruction dispatch (which is really slow due to costly mispredictions), threading can't help. This is shown by the number of interpreters using threading. Wikipedia has more pointers on this: http://en.wikipedia.org/wiki/Threaded_code Note that what I called indirect threading is called there instead token threading. Another example of the importance of threading is also shown in this article: http://webkit.org/blog/189/announcing-squirrelfish/ Some clues about why Python does not use threading: http://mail.python.org/pipermail/python-list/1999-May/002003.html It is important to note that people in that mail are not aware of why threading gives a speedup. == For SunCC, I can't say anything without looking at: a) the generated code; if jump targets were aligned only for switch but not for computed gotos, for instance, that could maybe explain such a slowdown. Lots of other details might be relevant. b) performance counters results, especially regarding mispredictions of indirect branches. ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4753 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4753] Faster opcode dispatch on gcc
Changes by Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com: -- nosy: +blaisorblade ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4753 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4753] Faster opcode dispatch on gcc
Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment: Mentioning other versions as well. The patch is so easy that it can be backported to all supported versions, so I'm adding all of them (2.5 is in bugfix-only mode, and as far as I can see this patch cannot be accepted there, sadly). -- versions: +Python 2.6, Python 2.7, Python 3.0 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4753 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4753] Faster opcode dispatch on gcc
Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment: Some other comments. The time saving of indirect threading are also associated with the removal of the range check, but better branch prediction is the main advantage. Also, the macro USE_THREADED_CODE should be renamed to something else; the word thread is too tightly associated with multi-threading. That's true. Furthermore, threaded code simply refers to code consisting only of function calls. Maybe, USE_COMPUTED_GOTO or USE_DIRECT_DISPATCH would be better. I'd prefer USE_DIRECT_DISPATCH (or better, USE_THREADED_DISPATCH) rather than USE_COMPUTED_GOTO, since the latter is just the used language construct. indirect threading is the standard name in CS literature to define this technique. Direct threading is a variation where in the bytecode array, opcode is replaced by the pointer opcode_handler[opcode], so that dispatch is slightly faster. Most interpreters use indirect threading to save memory, and because it enables to switch the opcode handlers table to activate for instance debugging. The best paper about this is: The Structure and Performance of Efficient Interpreters, M. Anton Ertl and David Gregg, 2003. The original paper about (direct) threaded code is this: Threaded Code, James R. Bell, Comm. of ACM, 1973, but I don't feel it relevant at all today. Indirect threading was introduced in Indirect Threaded Code, Robert B. K. Dewar, Communications of the ACM, 1975 (that's just a bit more relevant, but still). ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4753 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4753] Faster opcode dispatch on gcc
Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment: You may want to check out issue1408710 in which a similar patch was provided, but failed to deliver the desired results. It's not really similar, because you don't duplicate the dispatch code. It took me some time to understand why you didn't change the goto fast_next_opcode, but that's where you miss the speedup. The only difference with your change is that you save the range check for the switch, so the slowdown probably comes from some minor output change from GCC I guess. Anyway, this suggests that the speedup really comes from better branch prediction and not from saving the range check. The 1st paper I mentioned simply states that saving the range check might make a small differences. The point is that sometimes, when you are going to flush the pipeline, it's like adding a few instructions, even conditional jumps, does not make a difference. I've observed this behaviour quite a few times while building from scratch a small Python interpreter. I guess (but this might be wrong) that's because the execution units were not used at their fullest, and adding conditional jumps doesn't make a differeence because flushing a pipeline once or twice is almost the same (the second flush removes just few instructions). Or something like that, I'm not expert enough of CPU architecture to be sure of such guesses. ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4753 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4753] Faster opcode dispatch on gcc
Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment: Topics 1) About different speedups on 32bits vs 64 bits 2) About PPC slowdown 3) PyPI === About different speedups on 32bits vs 64 bits === An interpreter is very register-hungry, so on x86_64 it spends much less time on register spill (i.e. moving locals from/to memory), so instruction dispatch takes a bigger share of execution time. If the rest of the interpreter is too slow, indirect threading gives no advantage. Look at the amount of register variables in PyEval_EvalFrameEx() (btw, do you support any compiler where nowadays 'register' still makes a difference? That's quite weird). Lots of them are simply cached copies of fields of the current frame and of the current function; without copying them to locals, the compiler should assume they could change at any function call. In fact, adding locals this way gave huge speedups on tight loops on the Python interpreter I built with a colleague for our student project, to experiment with speeding up Python. And adding a write to memory in the dispatch code (to f-last_i) gave a 20% slowdown. Since my interpreter uses a copying garbage collector and CPython uses reference counting, which is much slower (if you don't know this, show me a fast JVM with reference counting), I'm even surprised you can get such a big speedup from threading. === About PPC slowdown === Somebody should try the patch on Pentium4 as well. During our VM course, threading slowed down a toy interpreter with 4 toy opcodes. Our teachers suggested that with such a small interpreter, since threaded code takes more space (in that case, ~64 vs ~100 bytes), this could give problems with code caches, but suggested checking that idea using performance counters. I'm not sure about why. I don't have right now neither a Pentium4 nor a PowerPC available, so I can't check myself. But this is the best way to analyze the performance unexpected behaviour. === PyPI === Paolo (2.5 is in bugfix-only mode, and as far as I can see this patch Paolo cannot be accepted there, sadly). Skip You could backport it to 2.4 2.5 and just put it up on PyPI... I was thinking to a private backport as well. I didn't know about PyPI, it looks like PyPI is more for contributed modules than for this, would that work? ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4753 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4753] Faster opcode dispatch on gcc
Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment: == On the patch itself == Why don't you use the C preprocessor instead of that Python code? Sample code: #define OPCODE_LIST(DEFINE_OPCODE) \ DEFINE_OPCODE(STOP_CODE, 0) \ DEFINE_OPCODE(POP_TOP, 1) \ DEFINE_OPCODE(ROT_TWO, 2) \ DEFINE_OPCODE(ROT_THREE, 3) \ DEFINE_OPCODE(DUP_TOP, 4) \ DEFINE_OPCODE(ROT_FOUR, 5) \ DEFINE_OPCODE(NOP, 9) \ ... # define DECL_OPCODE(opcode)\ [opcode] = label_ ## opcode, void *opcodes[] = { OPCODE_LIST(DECL_OPCODE) }; # undef DECL_OPCODE There are also other ways to do it, but using higher-order functions within the preprocessor in this way is something that I learned from the V8 source code. It has the advantage that OPCODE_LIST can be used in a lot of other places (maybe when implementing the 'opcode' module, if it's written in C). ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4753 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com