On 29.10.2011 3:15, Manu wrote:

    This is instruction dispatch, the trick in the branch prediction
    that operates on per branch basis, thus single switch-jump based VM
    dispatch will mispredict jumps most of the time. I seen claims of up
    to 99% on average.
    If you place a whole switch at the end of every instruction code I'm
    not sure compiler will find it's way out of this mess, or even
    optimize it. I tried that with DMD, the problem is it have no idea
    how to use the *same* jump *table* for all of identical switches.


I don't quite follow what you mean here... so forgive me if I'm way off,
or if you're actually agreeing with me :)
Yes, this piece of code is effectively identical to the switch that may
fall at the top of an execute loop (assuming the switch compiles a jump
table), but there are some subtle differences...
As said before, there's no outer loop, which saves some code and a
branch,

It doesn't save code - you replace unconditional jump to start of switch, with an extra load from table + indirect jump per VM opcode.

but more importantly, on architectures that have branch target
registers, the compiler can schedule the load to the branch target
earlier (while the rest of the 'opcode' is being executed), which will
give the processor more time to preload the instruction stream from the
branch destination, which will reduce the impact of the branch
misprediction.

I was thinking x86 here, but there are other processors that do not have branch target register. And depending on the pipeline length prediction it should be done earlier then loading of destination address.


I'm not sure what you're trying to say about the branch prediction, but
this isn't a 'branch' (nor is a switch that produces a jump table), it's
a jump, and it will never predict since it doesn't have a binary target.

I'm not sure about terminology, but branch === jump, and both can be conditional. By switch-jump I mean tabulated indirect jump, that is target is loaded from table. It's an indirect jump and as such can go wherever it wants to. And yes it's still predicted, mostly on basis of "it will go to the same address as last time". Actually, branch benediction mechanisms I heard of don't know if some jump is conditional at all (that would complicate it), the end result is just a block of "branch/jump at address X is predicted to go to address Y".

Many architectures attempt to alleviate this sort of unknown target
penalty by introducing a branch target register, which once loaded, will
immediately start fetching opcodes from the target.. the key for the
processor is to load the target address into that register as early as
possible to hide the instruction fetch latency. Code written in the
style I illustrate will give the best possible outcome to that end.

Nice to know, which ones by the way?

--
Dmitry Olshansky

Reply via email to