Hello, "Walter Bright" <newshou...@digitalmars.com> wrote in message news:ijnt3o$22dm$1...@digitalmars.com... > nedbrek wrote: >> Reordering happens in the scheduler. A simple model is "Fetch", >> "Schedule", "Retire". Fetch and retire are done in program order. For >> code that is hitting well in the cache, the biggest bottleneck is that >> "4" decoder (the complex instruction decoder). Reducing the number of >> complex instructions will be a big win here (and settling them into the >> 4-1-1(-1) pattern). >> >> Of course, on anything after Core 2, the "1" decoders can handle pushes, >> pops, and load-ops (r+=m) (although not load-op-store (m+=r)). >> >> Also, "macro op fusion" allows you can get a branch along with the last >> instruction in decode, potentially giving you 5 macroinstructions per >> cycle from decode. Make sure it is the flags producing instruction >> (cmp-br). >> > > I can't find any Intel documentation on this. Can you point me to some?
The best available source is the optimization reference manual (http://www.intel.com/products/processor/manuals/). The latest version is 248966.pdf, which mentions "Decodes up to four instructions, or up to five with macro-fusion" (page 33). Also, page 36: "Macro-fusion merges two instructions into a single ?op. Intel Core microarchitecture is capable of one macro-fusion per cycle in 32-bit operation". It's unclear if macro fusion is off entirely in 64 bit mode, and whether this has changed in more recent processors... They recommend against aligning code in general to 4-1-1-1 (also page 36), but I'd assume this is for a very targeted application. As always, it is best to run things both ways and measure. The next section (2.1.2.5) talks about stack pointer tracking - which allows macro operations which used to be 2 uops (pop r -> load r = [esp]; inc esp) to become one (just the load). Pushes, which used to be 3 uops (store_address esp, store_data r, dec esp) should also be one fused uop (via sta/std fusion and store point tracking). ---- Another good resource is "Real World Tech", particularly: http://www.realworldtech.com/page.cfm?ArticleID=RWT030906143144 Page 4 covers the front end: "Macro-op fusion lets the decoders combine two macro instructions into a single uop. Specifically, x86 compare or test instructions are fused with x86 jumps to produce a single uop and any decoder can perform this optimization." ---- Finally, the Intel Technology Journal has some really good details (when you can find them! :) For example: http://download.intel.com/technology/itj/2003/volume07issue02/art03_pentiumm/vol7iss2_art03.pdf details the original processor to use micro-op fusion (Pentium M or Banias - which was the base design for Dothan and Yonah). See page 26 (epage 7/18) - which starts the section "MICRO-OPS FUSION". It gives a lot of detail of the store address / store data fusion. Hope that helps, Ned