Re: tooling quality and some random rant

nedbrek Sat, 19 Feb 2011 12:20:56 -0800

Hello,

"Walter Bright" <newshou...@digitalmars.com> wrote in message 
news:ijnt3o$22dm$1...@digitalmars.com...
> nedbrek wrote:
>> Reordering happens in the scheduler. A simple model is "Fetch", 
>> "Schedule", "Retire".  Fetch and retire are done in program order.  For 
>> code that is hitting well in the cache, the biggest bottleneck is that 
>> "4" decoder (the complex instruction decoder).  Reducing the number of 
>> complex instructions will be a big win here (and settling them into the 
>> 4-1-1(-1) pattern).
>>
>> Of course, on anything after Core 2, the "1" decoders can handle pushes, 
>> pops, and load-ops (r+=m) (although not load-op-store (m+=r)).
>>
>> Also, "macro op fusion" allows you can get a branch along with the last 
>> instruction in decode, potentially giving you 5 macroinstructions per 
>> cycle from decode.  Make sure it is the flags producing instruction 
>> (cmp-br).
>>
>
> I can't find any Intel documentation on this. Can you point me to some?

The best available source is the optimization reference manual
(http://www.intel.com/products/processor/manuals/). The latest version is
248966.pdf, which mentions "Decodes up to four instructions, or up to five
with macro-fusion" (page 33). Also, page 36: "Macro-fusion merges two
instructions into a single ?op. Intel Core microarchitecture is capable of
one macro-fusion per cycle in 32-bit operation". It's unclear if macro
fusion is off entirely in 64 bit mode, and whether this has changed in more
recent processors...

They recommend against aligning code in general to 4-1-1-1 (also page 36),
but I'd assume this is for a very targeted application. As always, it is
best to run things both ways and measure.

The next section (2.1.2.5) talks about stack pointer tracking - which allows
macro operations which used to be 2 uops (pop r -> load r = [esp]; inc esp)
to become one (just the load). Pushes, which used to be 3 uops
(store_address esp, store_data r, dec esp) should also be one fused uop (via
sta/std fusion and store point tracking).

----
Another good resource is "Real World Tech", particularly:
http://www.realworldtech.com/page.cfm?ArticleID=RWT030906143144

Page 4 covers the front end: "Macro-op fusion lets the decoders combine two
macro instructions into a single uop. Specifically, x86 compare or test
instructions are fused with x86 jumps to produce a single uop and any
decoder can perform this optimization."

----
Finally, the Intel Technology Journal has some really good details (when you
can find them! :)

For example:
http://download.intel.com/technology/itj/2003/volume07issue02/art03_pentiumm/vol7iss2_art03.pdf

details the original processor to use micro-op fusion (Pentium M or Banias -
which was the base design for Dothan and Yonah). See page 26 (epage 7/18) -
which starts the section "MICRO-OPS FUSION". It gives a lot of detail of
the store address / store data fusion.

Hope that helps,
Ned

Re: tooling quality and some random rant

Reply via email to