[issue4941] Tell GCC Py_DECREF is unlikely to call the destructor

2009-01-26 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment:

Probably #if the definitions of Py_LIKELY and Py_UNLIKELY instead of
__builtin_expect so new compilers can easily add their own definitions.

This was done in the first version, but with the currently supported
compilers it's simpler to do like that, because both GCC and ICC support
the same __builtin_expect syntax, so you get less code this way.

Anyway, the code was inspired from the Linux kernel which only
supports those two compilers, so anybody more knowledgeable is welcome
to suggest how to express this with other compilers.

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4941
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-26 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment:

-fno-gcse is controversial.
Even if it might avoid jumps sharing, the impact of that option has to
be measured, since common subexpression elimination allows omitting some
recalculations, so disabling global CSE might have a negative impact on
other code.

It would be maybe better to disable GCSE only for the interpreter loop,
but that would make some intra-file inlining impossible.

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4753
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4941] Tell GCC Py_DECREF is unlikely to call the destructor

2009-01-16 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment:

Yep, agreed. It could be quite cool to rely on __attribute__((cold)) to
mark format_exc_check_arg(), but that only works with newer gcc's. I
guess I'll add many likely() by hand - in some cases GCC might already
get it right, but the heuristics used to choose the likely path
statically are quite arguable and subject to change. Also, on other
archs such hints might have a bigger impact on the generated code.

I'll give a look at that not earlier than February, or you're welcome to
try this. However, at least on x86, an if (successCondition) {
success(); dispatch(); } failure(); will be probably compiled to
assembly like follows, equivalent to adding likely to the test:

evaluate successCondition
if_false goto err:
call to success();
dispatch next instrutions.

err:
call to failure();

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4941
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4896] Faster why variable manipulation in ceval.c

2009-01-16 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment:

Given a 10% speedup on some systems, and statistically insignificant
changes on other systems, I would still apply the patch, even simply
because the bitmask part simply makes more sense.

I'm not sure about the goto part, but still, it does straighten the
code. Anyway, simply call the label why_is_WHY_NOT,
why_set_to_WHY_NOT or something like that. Verbosity on a use-once
label used with goto should be encouraged - we're not Java programmer,
but we need to pay for using goto by increasing readability in other ways.

@collinwinter: since the differences you report are so low (similar to
the statistical noise I get on my machine), I would expect that you're
just getting statistical noise instead of different results depending on
the GCC version, unless you performed statistical hypothesis testing
(confidence intervals and related stuff). And if I had done the needed
tens/hundreds of repetitions for hypothesis testing, I'd state that
clearly, so I suppose you didn't, and that's fully acceptable since the
result is likely to be statistically insignificant anyway.

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4896
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4941] Tell GCC Py_DECREF is unlikely to call the destructor

2009-01-13 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment:

If speedup on other machines are confirmed, a slowdown of less than 2%
should be acceptable (it's also similar to the statistic noise I have on
my machine - it's a pity pybench does not calculate the standard
deviation of the execution time).

It is true that many objects are short lived, and that makes
generational GC work, but the patch is about another point. Do most
refcount decrements lead to garbage collection? That's a completely
different question. And I don't think that popping a function call
argument usually leads to a destructor invocation, for instance, unless
the argument is a temporary. The same reasoning holds for most other
opcodes.

The simplest way to check this is to look at the compilation output with
Profile-Guided Optimization active and verify which branch is optimized.

Having said that, the slowdown you get (on your Athlon X2 3600+) might
be due to a higher cost in case of failed static prediction, since that
leads to 2 jumps.

This code leads to 0 or 1 jump:

if (likely(cond)) {
  destructor
}

while with unlikely, one runs through either 0 or 2 jumps. The second
jump is unconditional, so it has a much smaller delay (the decode unit
has to tell the fetch unit to fetch another instruction). However, in
that case one has to invoke anyway the destructor through a virtual
call, which is much more expensive.

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4941
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-12 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment:

A couple percent maybe is not worth vmgen-ing. But even if I'm not a
vmgen expert, I read many papers from Ertl about superinstructions and
replication, so the expected speedup from vmgen'ing is much bigger.
Is there some more advanced feature we are not using and could use?
Have the previous static predictions be converted to superinstructions?
Have other common sequences be treated like that?
Is there an existing discussion on this?

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4753
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4896] Faster why variable manipulation in ceval.c

2009-01-12 Thread Paolo 'Blaisorblade' Giarrusso

Changes by Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com:


--
nosy: +blaisorblade

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4896
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-12 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment:

Ok, then vmgen adds almost just direct threading instead of indirect
threading.

Since the purpose of superinstructions is to eliminate dispatch
overhead, and that's more important when little actual work is done,
what about all ones which unconditionally end with FAST_DISPATCH and are
common? DUP_TOP, POP_TOP, DUP_TOPX(2,3) and other stack handling staff
which can't fail? To have any of them + XXX without error handling
problems? Removing handling of DUP_TOPX{4,5} is implied, you shouldn't
check functionality of the compiler during interpretation - indeed, even
the idea using a parameter for that is a waste. Have DUP_TOPX2 and
DUP_TOPX3, like JVM, is just simpler.

 Replication would be trickier since we want the bytecode generation to
be deterministic, but it's probably doable too.

Bytecode conversion during I/O is perfectly fine, to convert from the
actual bytecode to one of the chosen replicas. Conversion in a rescan
pass can be also OK (less cache friendly though, so if it's easy to
avoid, please do).

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4753
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-10 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment:

The standing question is still: can we get ICC to produce the expected 
output? It looks like we still didn't manage, and since ICC is the best 
compiler out there, this matters.
Some problems with SunCC, even if it doesn't do jump sharing, it seems 
that one doesn't get the speedups - I guess that on most platforms we 
should select the most common alternative for interpreters (i.e. no 
switch, one jump table, given by threadedceval5.patch + 
abstract-switch-reduced.diff).

On core platforms we can spend time on fine-tuning - and the definition 
of core platforms is given by do developers want to test for that?.

When that's fixed, I think that we just have to choose the simpler form 
and merge that.

@alexandre:
[about removing the switch]
 There is no speed difference on pybench on x86; on x86-64, the code 
is slower due to the opcode fetching change.

Actually, on my machine it looks like the difference is caused by the 
different layout caused by switch removal, or something like that, 
because fixing the opcode fetching doesn't make a difference here (see 
below).

Indeed, I did my benchmarking duties. Results are that 
abstract-switch-reduced.diff (the one removing the switch) gives a 1-3% 
slowdown, and that all the others don't make a significant difference. 
The differences in the assembly output seem to be due to a different 
code layout for some branches, I didn't have a closer look.

However, experimenting with -falign-labels=16 can give a small speedup, 
I'm trying to improve the results (what I actually want is to align 
just the opcode handlers, I'll probably do that by hand).

reenable-static-prediction can give either a slowdown or a speedup by 
around 1%, i.e. around the statistical noise.

Note that on my machine, I get only a 10% speedup with the base patch, 
and that is more reasonable here. In the original thread on PyPy-dev, I 
got a 20% one with the Python interpreter I built for my student 
project, since that one is faster* (by a 2-3x factor, like PyVM), so 
the dispatch cost is more significant, and reducing it has a bigger 
impact. In fact, I couldn't believe that Python got the same speedup.

This is a Core 2 Duo T7200 (Merom) in 64bit mode with 4MB of L2 cache, 
and since it's a laptop I expect it to have slower RAM than a desktop.

@alexandre:
 The patch make a huge difference on 64-bit Linux. I get a 20% 
speed-up and the lowest run time so far. That is quite impressive!
Which processor is that?

@pitrou:
 The machine I got the 15% speedup on is in 64-bit mode with gcc
4.3.2.

Which is the processor? I guess the bigger speedups should be on 
Pentium4, since it has the bigger mispredict penalties.


*DISCLAIMER: the interpreter of our group (me and Sigurd Meldgaard) is 
not complete, has some bugs, and the source code has not yet been 
published, so discussion about why it is faster shall not happen here - 
I want to avoid any flame.
I believe it's not because of skipped runtime checks or such stuff, but 
because we used garbage collection instead of refcounting, indirect  
threading and tagged integers, but I don't have time to discuss that 
yet.
The original thread on pypy-dev has some insights if you are interested 
on this.

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4753
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-10 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment:

 Same for CPU-specific tuning: I don't think we want to ship Python
with compiler flags which depend on the particular CPU being used.

I wasn't suggesting this - but since different CPUs have different
optimization rules, something like oh, 20% performance slowdown on
PowerPC or on P4 is important to know (and yeah, configure options
are a good solution).

Which is the barrier for platform-specific tricks, as long as the code
is still portable? I'd like to experiment with __builtin_expect and with
manual alignment (through 'asm volatile(.p2align 4)' on x86/x86_64
with GAS - PPC might need a different alignment probably).

All hidden through macros to make it disappear on unsupported platforms,
without any configure option for them (there shouldn't be the need for
that).

 I doubt many people compile Python with icc, honestly.

Yep :-(. rantWhy don't distributors do it?/rant (First culprit might
be license/compatibility problems I guess, but the speedup would be
worth the time to fix the troubles IMHO).

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4753
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-09 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment:

@ ajaksu2
 Applying your patches makes no difference with gcc 4.2 and gives a
 barely noticeable (~2%) slowdown with icc.
Your patches is something quite unclear :-)
Which are the patch sets you are comparing?
And on 32 or 64 bits? But does Yonah supports 64bits? IIRC no, but I'm
not sure.
I would be surprised from slowdowns for restore-old-oparg-load.diff,
really surprised.
And I would be just surprised by slowdowns on
reenable-static-prediction.diff.
Also, about ICC output, we still need to ensure that it's properly
compiled (see above the instructions for counting jmp * or similar).
In the measurements above, ICC did miscompile the patch with the switch.
By properly compiled I mean that separate indirect branches are
generated, instead of just one.

 These results are from a
 Celeron M 410 (Core Solo Yonah-based), so it's a rather old platform to
 run benchmarks on.

Not really - at the very least we should listen to results on Pentium 4,
Core (i.e. Yonah) and Core 2, and I would also add Pentium3/Pentium M to
represent the P6 family.
Anyway, I have to do my benchmarks on this, I hope this weekend I'll
have time.

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4753
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-07 Thread Paolo 'Blaisorblade' Giarrusso

Changes by Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com:


Added file: http://bugs.python.org/file12634/restore-old-oparg-load.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4753
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-07 Thread Paolo 'Blaisorblade' Giarrusso

Changes by Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com:


Added file: http://bugs.python.org/file12633/abstract-switch-reduced.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4753
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-07 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment:

I finally implemented my suggestion for the switch elimination.
On top of threadedceval5.patch, apply abstract-switch-reduced.diff and
then restore-old-oparg-load.diff to test it.

This way, only computed goto's are used. I would like who had
miscompilation problems, or didn't get advantage from the patch, to try
compiling and benchmarking this version.

I've also been able to reenable static prediction (PREDICT_*) on top of
computed gotos, and that may help CPU prediction even more (the BTB for
the computed goto will be used to predict the 2nd most frequent target);
obviously it may instead cause a slowdown, I'd need stats on opcode
frequency to try guessing in advance (I'll try gathering them later
through DYNAMIC_EXECUTION_PROFILE).

Apply reenable-static-prediction.diff on top of the rest to get this.

I'll have to finish other stuff before closing everything to run
pybench, I can't get stable timings otherwise, so it'll take some time
(too busy, sorry). However I ran the check for regressions and they show
none.


abstract-switch-reduced.diff is the fixed abstract-switch.diff -
actually there was just one hunk which changed the handling of f_lasti,
and that looked extraneous. See the end of the message.

--- a/Python/ceval.cThu Jan 01 23:54:01 2009 +0100
+++ b/Python/ceval.cSun Jan 04 14:21:16 2009 -0500
@@ -1063,12 +1072,12 @@
}

fast_next_opcode:
-   f-f_lasti = INSTR_OFFSET();

/* line-by-line tracing support */

if (_Py_TracingPossible 
tstate-c_tracefunc != NULL  !tstate-tracing) {
+   f-f_lasti = INSTR_OFFSET();
/* see maybe_call_line_trace
   for expository comments */
f-f_stacktop = stack_pointer;

Added file: http://bugs.python.org/file12635/reenable-static-prediction.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4753
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-07 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment:

@skip:
In simple words, the x86 call:
  call 0x2000
placed at address 0x1000 becomes:
  call %rip + 0x1000

RIP holds the instruction pointer, which will be 0x1000 in this case
(actually, I'm ignoring the detail that when executing the call, RIP
points to the first byte of the next instruction).

If I execute the same instruction from a different location (i.e.
different RIP), things will break. So, only code for opcodes without
real calls, nor access to globals can be copied like this (inlines are OK).
With refcounting, not even POP_TOP is safe since it can call
destructors. DUP_TOP is still safe, I guess.

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4753
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-06 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment:

@pitrou:
ranting mode
Argh, reference counting hinders even that?
/ranting mode
I just discovered another problem caused by refcounting.

Various techniques allow to create binary code from the interpreter
binary, by just pasting together the code for the common interpreters
cases and producing calls to the other. But, guess what, on most
platforms (except plain x86, but including x86_64 and maybe x86 for the
shared library case) this does not work if the copied code includes
function calls (on x86_64 that's due to RIP-relative addressing, and on
similar issues on other platforms).

So, Python could not even benefit from that! That's a real pity...
I'll have to try with subroutine threading, to see if that's faster than
indirect threading on current platforms or if it is still slower.

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4753
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-04 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment:

@Alexandre:
  So, can you try dropping the switch altogether, using always computed
  goto and seeing how does the resulting code get compiled?

 Removing the switch won't be possible unless we change the semantic
 EXTENDED_ARG. In addition, I doubt the improvement, if any, would worth
 the increased complexity.
OK, it's time that I post code to experiment with that - there is no
need to break EXTENDED_ARG. And the point is to fight miscompilations.

 Do you actually mean the time spent interpreting bytecodes compared to
the time spent in the other parts of Python? If so, your figures are
wrong for CPython on x86-64. It is about 50% just like on x86 (when
running pybench). With the patch, this drops to 35% on x86-64 and to 45%
on x86.

More or less, I mean that, but I was making an example, and I made up
reasonable figures.
70%, or even more, just for _dispatch_ (i.e. just for the mispredicted
indirect jump), is valid for real-world Smalltalk interpreters for
instance, or for the ones in The Structure and Performance of Efficient
Interpreters.
But, when you say intepreting opcodes, I do not know which part you
refer to, if just the computed goto or for the whole code in the
interpreter function.

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4753
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-04 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment:

@Skip: if one decides to generate binary code, there is no need to use
switches. Inline threading (also known as code copying in some
research papers) is what you are probably looking for:

http://blog.mozilla.com/dmandelin/2008/08/27/inline-threading-tracemonkey-etc/

For references and background on threading techniques mentioned there, see:

http://en.wikipedia.org/wiki/Threaded_code
http://www.complang.tuwien.ac.at/forth/threaded-code.html

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4753
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-04 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment:

@alexandre: if you add two labels per opcode and two dispatch tables,
one before (like now) and one after the parameter fetch (where we have
the 'case'), you can keep the same speed.
And under the hood we also had two dispatch tables before, with the
switch, so it's not a big deal; finally, the second table is only used
in the slow path (i.e. EXTENDED_ARG, or when additional checks are needed).

About f-last_i, when I have time I want to try optimizing it. Somewhere
you can be sure it's not going to be used.
But you have some changes about that in the abstract-switch patch, I
think that was not intended?

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4753
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-03 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment:

1st note: is that code from the threaded version? Note that you need to
modify the source to make it accept also ICC to try that.
In case you already did that, I guess the patch is not useful at all
with ICC since, as far as I can see, the jump is shared. It is vital to
this patch that the jump is not shared, something similar to
-fno-crossjumping should be found.

2nd note: the answer to your questions seems yes, ICC has less register
spills. Look for instance at:
   movl-272(%ebp), %ecx
   movzbl  (%ecx), %eax
   addl$1, %ecx

and
   movzbl(%esi), %ecx
   incl  %esi

This represents the increment of the program counter after loading the
next opcode. In the code you posted, one can see that the program
counter is spilled to memory by GCC, but isn't by ICC. Either the spill
is elsewhere, or ICC is better here. And it's widely known that ICC has
a much better optimizer in many cases, and I remember that GCC register
allocator really needs improvement.

Finally, I'm a bit surprised by addl $1, %ecx, since any peephole
optimizer should remove that; I'm not shocked just because I've never
seen perfect GCC output.

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4753
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-03 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment:

Daniel, I forgot to ask for the compilation command line you used, since
they make a lot of difference. Can you post them? Thanks

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4753
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-02 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment:

About miscompilations: the current patch is a bit weird for GCC, because
you keep both the switch and the computed goto.

But actually, there is no case in which the switch is needed, and
computed goto give less room to GCC's choices.

So, can you try dropping the switch altogether, using always computed
goto and seeing how does the resulting code get compiled? I see you'll
need two labels (before and after argument fetch) per opcode and two
dispatch tabels, but that's no big deal (except for code alignment -
align just the common branch target).

An important warning is that by default, on my system, GCC 4.2 aligns
branch targets for switch to a 16-byte boundary (as recommended by the
Intel optimization guide), by adding a .p2align 4,,7 GAS directive,
and it does not do that for computed goto.

Adding the directive by hand gave a small speedup, 2% I think; I should
try -falign-jumps=16 if it's not enabled (some -falign-jumps is enabled
by -O2), since that is supposed to give the same result.

Please use that yourself as well, and verify it works for labels, even
if I fear it doesn't.

 However, I don't know why the speed up due to the patch is much
more significant on x86-64 than on x86.

It's Amdahl's law, even if this is not about parallel code. When the
rest is faster (x86_64), the same speedup on dispatch gives a bigger
overall speedup.

To be absolutely clear: x86_64 has more registers, so the rest of the
interpreter is faster than x86, but dispatch still takes the same
absolute time, which is 70% on x86_64, but only 50% on x86 (those are
realistic figures); if this patch halved dispatch time on both (we're
not so lucky), we would save 35% on x86_64 but only 25% on x86.
In fact, on inefficient interpreters, indirect threading is useless
altogether.

So, do those extra register help _so_ much? Yes. In my toy interpreter,
computing last_i for each dispatch doesn't give any big slowdown, but
storing it in f-last_i gives a ~20% slowdown - I cross-checked multiple
times because I was astonished. Conversely, when the program counter had
to be stored in memory, I think it was like 2x slower.

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4753
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-02 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment:

 I'm not an expert in this kind of optimizations. Could we gain more
speed by making the dispatcher table more dense? Python has less than
128 opcodes (len(opcode.opmap) == 113) so they can be squeezed in a
smaller table. I naively assume a smaller table increases the amount of
cache hits.

Well, you have no binary compatibility constraint with a new release, so
it can be tried and benchmarked, or it can be done anyway!
On x86_64 the impact of the jump table is 8 bytes per pointer * 256
pointers = 2KiB, and the L1 data cache of Pentium4 can be 8KiB or 16KiB
wide.
But I don't expect this to be noticeable in most synthetic
microbenchmarks. Matrix multiplication would be the perfect one I guess;
the repeated column access would kill the L1 data cache, if the whole
matrixes don't fit.

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4753
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-01 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment:

 We would have to change opcode.h for this to be truely useful (in
order to re-use OPCODE_LIST()).

Yep.

 I think that should be the subject of a separate bug entry for code
reorganization.

Agreed, I'll maybe try to find time for it.

 Thanks for all the explanation and pointers!
You're welcome, thanks to you for writing the patch! And 
 About register allocation, I wonder if the temporary variables u,v,w
could be declared separately in each opcode rather than at the top of
the eval function, so that the compiler doesn't try to store their
values between two opcodes.

I didn't look at the output, but that shouldn't be needed with decent
optimizers, since they are local variables, so the compiler has a full
picture of their usage (this does not happen with the content of the
heap, where the frame object may lie).

I think that change would just save some compilation time for dataflow
analysis, maybe :-). Or could make clearer which variables is used
where, but that is a matter of taste (what's there is fine for me).

I just studied liveness analysis in compilers, and it computes whether a
variable is live before and after each statement; if the value of a
variable is not used in some piece of code until the next write to the
variable, it is considered dead in that piece of code, and that variable
does not take space; since u, v, w are either unused or are written to
before usage in all opcodes, they're dead at the beginning of each
opcode, so they're also dead just before dispatch.


The only important thing is that the content of the jump table are known
to the compiler and that the compiler makes use of that. Simply passing
a non-const jump table to some function defined elsewhere (which could
in theory modify it) would make the output much worse.

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4753
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2009-01-01 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment:

 I attached some additional benchmarks on SunOS. So far, it seems the
benefits of the proposed optimization are highly compiler-dependent.

Well, it would be more correct to say that as you verified for GCC 3.4,
miscompilation of the code happens easily.

Any literature research shows that threading in a fast interpreter does
help. My experience shows two exceptions to this rule:
a) bad compiler output
b) interpreters which are not efficient enough - when other operations
are even slower than instruction dispatch (which is really slow due to
costly mispredictions), threading can't help.

This is shown by the number of interpreters using threading.

Wikipedia has more pointers on this:
http://en.wikipedia.org/wiki/Threaded_code
Note that what I called indirect threading is called there instead
token threading.

Another example of the importance of threading is also shown in this
article:
http://webkit.org/blog/189/announcing-squirrelfish/

Some clues about why Python does not use threading:

http://mail.python.org/pipermail/python-list/1999-May/002003.html

It is important to note that people in that mail are not aware of why
threading gives a speedup.

==
For SunCC, I can't say anything without looking at:
a) the generated code; if jump targets were aligned only for switch but
not for computed gotos, for instance, that could maybe explain such a
slowdown. Lots of other details might be relevant.
b) performance counters results, especially regarding mispredictions of
indirect branches.

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4753
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2008-12-31 Thread Paolo 'Blaisorblade' Giarrusso

Changes by Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com:


--
nosy: +blaisorblade

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4753
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2008-12-31 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment:

Mentioning other versions as well.
The patch is so easy that it can be backported to all supported
versions, so I'm adding all of them (2.5 is in bugfix-only mode, and as
far as I can see this patch cannot be accepted there, sadly).

--
versions: +Python 2.6, Python 2.7, Python 3.0

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4753
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2008-12-31 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment:

Some other comments.
The time saving of indirect threading are also associated with the
removal of the range check, but better branch prediction is the main
advantage.

 Also, the macro USE_THREADED_CODE should be renamed to something else;
 the word thread is too tightly associated with multi-threading.

That's true.

 Furthermore, threaded code simply refers to code consisting only of
 function calls. Maybe, USE_COMPUTED_GOTO or USE_DIRECT_DISPATCH would be
 better.

I'd prefer USE_DIRECT_DISPATCH (or better, USE_THREADED_DISPATCH) rather
than USE_COMPUTED_GOTO, since the latter is just the used language
construct.

indirect threading is the standard name in CS literature to define
this technique. Direct threading is a variation where in the bytecode
array, opcode is replaced by the pointer opcode_handler[opcode], so that
dispatch is slightly faster. Most interpreters use indirect threading to
save memory, and because it enables to switch the opcode handlers table
to activate for instance debugging.

The best paper about this is:
The Structure and Performance of Efficient Interpreters, M. Anton Ertl
and David Gregg, 2003.

The original paper about (direct) threaded code is this:
Threaded Code, James R. Bell, Comm. of ACM, 1973, but I don't feel it
relevant at all today.
Indirect threading was introduced in Indirect Threaded Code, Robert B.
K. Dewar, Communications of the ACM, 1975 (that's just a bit more
relevant, but still).

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4753
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2008-12-31 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment:

 You may want to check out issue1408710 in which a similar patch was
 provided, but failed to deliver the desired results.
It's not really similar, because you don't duplicate the dispatch code.
It took me some time to understand why you didn't change the goto
fast_next_opcode, but that's where you miss the speedup.

The only difference with your change is that you save the range check
for the switch, so the slowdown probably comes from some minor output
change from GCC I guess.

Anyway, this suggests that the speedup really comes from better branch
prediction and not from saving the range check. The 1st paper I
mentioned simply states that saving the range check might make a small
differences. The point is that sometimes, when you are going to flush
the pipeline, it's like adding a few instructions, even conditional
jumps, does not make a difference. I've observed this behaviour quite a
few times while building from scratch a small Python interpreter.

I guess (but this might be wrong) that's because the execution units
were not used at their fullest, and adding conditional jumps doesn't
make a differeence because flushing a pipeline once or twice is almost
the same (the second flush removes just few instructions). Or something
like that, I'm not expert enough of CPU architecture to be sure of such
guesses.

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4753
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2008-12-31 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment:

Topics
1) About different speedups on 32bits vs 64 bits
2) About PPC slowdown
3) PyPI

=== About different speedups on 32bits vs 64 bits ===
An interpreter is very register-hungry, so on x86_64 it spends much less
time on register spill (i.e. moving locals from/to memory), so
instruction dispatch takes a bigger share of execution time. If the rest
of the interpreter is too slow, indirect threading gives no advantage.

Look at the amount of register variables in PyEval_EvalFrameEx() (btw,
do you support any compiler where nowadays 'register' still makes a
difference? That's quite weird). Lots of them are simply cached copies
of fields of the current frame and of the current function; without
copying them to locals, the compiler should assume they could change at
any function call.

In fact, adding locals this way gave huge speedups on tight loops on the
Python interpreter I built with a colleague for our student project, to
experiment with speeding up Python.

And adding a write to memory in the dispatch code (to f-last_i) gave a
20% slowdown. Since my interpreter uses a copying garbage collector and
CPython uses reference counting, which is much slower (if you don't know
this, show me a fast JVM with reference counting), I'm even surprised
you can get such a big speedup from threading.

=== About PPC slowdown ===
Somebody should try the patch on Pentium4 as well.

During our VM course, threading slowed down a toy interpreter with 4 toy
opcodes. Our teachers suggested that with such a small interpreter,
since threaded code takes more space (in that case, ~64 vs ~100 bytes),
this could give problems with code caches, but suggested checking that
idea using performance counters. I'm not sure about why.

I don't have right now neither a Pentium4 nor a PowerPC available, so I
can't check myself. But this is the best way to analyze the performance
unexpected behaviour.

=== PyPI ===
Paolo (2.5 is in bugfix-only mode, and as far as I can see this patch
Paolo cannot be accepted there, sadly).

Skip You could backport it to 2.4  2.5 and just put it up on PyPI...
I was thinking to a private backport as well.
I didn't know about PyPI, it looks like PyPI is more for contributed
modules than for this, would that work?

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4753
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4753] Faster opcode dispatch on gcc

2008-12-31 Thread Paolo 'Blaisorblade' Giarrusso

Paolo 'Blaisorblade' Giarrusso p.giarru...@gmail.com added the comment:

== On the patch itself ==
Why don't you use the C preprocessor instead of that Python code?
Sample code:

#define OPCODE_LIST(DEFINE_OPCODE) \
DEFINE_OPCODE(STOP_CODE, 0) \
DEFINE_OPCODE(POP_TOP, 1)   \
DEFINE_OPCODE(ROT_TWO, 2)   \
DEFINE_OPCODE(ROT_THREE, 3) \
DEFINE_OPCODE(DUP_TOP, 4)   \
DEFINE_OPCODE(ROT_FOUR, 5)  \
DEFINE_OPCODE(NOP, 9)   \
...

# define DECL_OPCODE(opcode)\
[opcode] =  label_ ## opcode,

void *opcodes[] = {
OPCODE_LIST(DECL_OPCODE)
};
# undef DECL_OPCODE

There are also other ways to do it, but using higher-order functions
within the preprocessor in this way is something that I learned from the
V8 source code.
It has the advantage that OPCODE_LIST can be used in a lot of other
places (maybe when implementing the 'opcode' module, if it's written in C).

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4753
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com