Richard, Could you provide us with a good reference for the latencies and other speed issues of SSE operations? What I've found is scattered and hard to compare.
Frankly, I was under the misconception that each of these SSE operatons was meant to be accomplished in a single clock cycle (although I knew there are various other issues.) Cheers! On 23.01.10, Richard Guenther wrote: > On Sat, Jan 23, 2010 at 6:33 PM, Steve White <swh...@aip.de> wrote: > > Hi, Andrew! > > ... > > > > Nevermind icc for the moment, with whatever trick it may be doing. > > Why is the SSE2 division so slow, compared to multiplication? > > > > Change one character in the division test to make a multiplication test. > > It is an order of magnitude difference in speed. > > It's because multiplication latency is like 4 cycles while division is about > 20, also one mutliplication can be issued per cycle while only every > 17th instruction can be a division (AMD Fam10 values). > > GCC performs loop interchange with -ftree-loop-linear but the pass > is scheduled in an unfortunate place so no further optimization happens. > > Richard. > -- | - - - - - - - - - - - - - - - - - - - - - - - - - | Steve White +49(331)7499-202 | e-Science / AstroGrid-D Zi. 35 Bg. 20 | - - - - - - - - - - - - - - - - - - - - - - - - - | Astrophysikalisches Institut Potsdam (AIP) | An der Sternwarte 16, D-14482 Potsdam | | Vorstand: Prof. Dr. Matthias Steinmetz, Peter A. Stolz | | Stiftung privaten Rechts, Stiftungsverzeichnis Brandenburg: III/7-71-026 | - - - - - - - - - - - - - - - - - - - - - - - - -