[issue4753] Faster opcode dispatch on gcc

Paolo 'Blaisorblade' Giarrusso Sat, 10 Jan 2009 00:52:40 -0800

Paolo 'Blaisorblade' Giarrusso <p.giarru...@gmail.com> added the comment:


The standing question is still: can we get ICC to produce the expected 
output? It looks like we still didn't manage, and since ICC is the best 
compiler out there, this matters.
Some problems with SunCC, even if it doesn't do jump sharing, it seems 
that one doesn't get the speedups - I guess that on most platforms we 
should select the most common alternative for interpreters (i.e. no 
switch, one jump table, given by threadedceval5.patch + 
abstract-switch-reduced.diff).

On core platforms we can spend time on fine-tuning - and the definition 
of "core platforms" is given by "do developers want to test for that?".

When that's fixed, I think that we just have to choose the simpler form 
and merge that.

@alexandre:
[about removing the switch]
> There is no speed difference on pybench on x86; on x86-64, the code 
is slower due to the opcode fetching change.

Actually, on my machine it looks like the difference is caused by the 
different layout caused by switch removal, or something like that, 
because fixing the opcode fetching doesn't make a difference here (see 
below).

Indeed, I did my benchmarking duties. Results are that 
abstract-switch-reduced.diff (the one removing the switch) gives a 1-3% 
slowdown, and that all the others don't make a significant difference. 
The differences in the assembly output seem to be due to a different 
code layout for some branches, I didn't have a closer look.

However, experimenting with -falign-labels=16 can give a small speedup, 
I'm trying to improve the results (what I actually want is to align 
just the opcode handlers, I'll probably do that by hand).

reenable-static-prediction can give either a slowdown or a speedup by 
around 1%, i.e. around the statistical noise.

Note that on my machine, I get only a 10% speedup with the base patch, 
and that is more reasonable here. In the original thread on PyPy-dev, I 
got a 20% one with the Python interpreter I built for my student 
project, since that one is faster* (by a 2-3x factor, like PyVM), so 
the dispatch cost is more significant, and reducing it has a bigger 
impact. In fact, I couldn't believe that Python got the same speedup.

This is a Core 2 Duo T7200 (Merom) in 64bit mode with 4MB of L2 cache, 
and since it's a laptop I expect it to have slower RAM than a desktop.

@alexandre:
> The patch make a huge difference on 64-bit Linux. I get a 20% 
speed-up and the lowest run time so far. That is quite impressive!
Which processor is that?

@pitrou:
> The machine I got the 15% speedup on is in 64-bit mode with gcc
4.3.2.

Which is the processor? I guess the bigger speedups should be on 
Pentium4, since it has the bigger mispredict penalties.

====
*DISCLAIMER: the interpreter of our group (me and Sigurd Meldgaard) is 
not complete, has some bugs, and the source code has not yet been 
published, so discussion about why it is faster shall not happen here - 
I want to avoid any flame.
I believe it's not because of skipped runtime checks or such stuff, but 
because we used garbage collection instead of refcounting, indirect  
threading and tagged integers, but I don't have time to discuss that 
yet.
The original thread on pypy-dev has some insights if you are interested 
on this.

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue4753>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue4753] Faster opcode dispatch on gcc

Reply via email to