Richard,

Could you provide us with a good reference for the latencies and other
speed issues of SSE operations?  What I've found is scattered and hard
to compare.

Frankly, I was under the misconception that each of these SSE operatons
was meant to be accomplished in a single clock cycle (although I knew there
are various other issues.)

Cheers!

On 23.01.10, Richard Guenther wrote:
> On Sat, Jan 23, 2010 at 6:33 PM, Steve White <swh...@aip.de> wrote:
> > Hi, Andrew!
> >
        ...
> >
> > Nevermind icc for the moment, with whatever trick it may be doing.
> > Why is the SSE2 division so slow, compared to multiplication?
> >
> > Change one character in the division test to make a multiplication test.
> > It is an order of magnitude difference in speed.
> 
> It's because multiplication latency is like 4 cycles while division is about
> 20, also one mutliplication can be issued per cycle while only every
> 17th instruction can be a division (AMD Fam10 values).
> 
> GCC performs loop interchange with -ftree-loop-linear but the pass
> is scheduled in an unfortunate place so no further optimization happens.
> 
> Richard.
> 

-- 
| -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -
| Steve White                                             +49(331)7499-202
| e-Science / AstroGrid-D                                   Zi. 35  Bg. 20
| -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -
| Astrophysikalisches Institut Potsdam (AIP)
| An der Sternwarte 16, D-14482 Potsdam
|
| Vorstand: Prof. Dr. Matthias Steinmetz, Peter A. Stolz
|
| Stiftung privaten Rechts, Stiftungsverzeichnis Brandenburg: III/7-71-026
| -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -

Reply via email to