nedbrek wrote:
Reordering happens in the scheduler. A simple model is "Fetch", "Schedule", "Retire". Fetch and retire are done in program order. For code that is hitting well in the cache, the biggest bottleneck is that "4" decoder (the complex instruction decoder). Reducing the number of complex instructions will be a big win here (and settling them into the 4-1-1(-1) pattern).

Of course, on anything after Core 2, the "1" decoders can handle pushes, pops, and load-ops (r+=m) (although not load-op-store (m+=r)).

Also, "macro op fusion" allows you can get a branch along with the last instruction in decode, potentially giving you 5 macroinstructions per cycle from decode. Make sure it is the flags producing instruction (cmp-br).

(I used to work for Intel :)

I can't find any Intel documentation on this. Can you point me to some?

Reply via email to