double and other scalar xmm,xmm instructions

hjl.tools at gmail dot com Mon, 28 Jan 2019 13:51:24 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071


--- Comment #4 from H.J. Lu <hjl.tools at gmail dot com> ---
(In reply to Peter Cordes from comment #2)
> (In reply to H.J. Lu from comment #1)
> > But
> > 
> >     vxorps  %xmm0, %xmm0, %xmm0
> >     vcvtsd2ss       %xmm1, %xmm0, %xmm0
> > 
> > are faster than both.
> 
> On Skylake-client (i7-6700k), I can't reproduce this result in a
> hand-written asm loop.  (I was using NASM to make a static executable that
> runs a 100M iteration loop so I could measure with perf).  Can you show some
> asm where this performs better?

Please try cvtsd2ss branch at:

https://github.com/hjl-tools/microbenchmark/

On Intel Core i7-6700K, I got

[hjl@gnu-skl-2 microbenchmark]$ make
gcc -g -I.    -c -o test.o test.c
gcc -g   -c -o sse.o sse.S
gcc -g   -c -o sse-clear.o sse-clear.S
gcc -g   -c -o avx.o avx.S
gcc -g   -c -o avx2.o avx2.S
gcc -g   -c -o avx-clear.o avx-clear.S
gcc -o test test.o sse.o sse-clear.o avx.o avx2.o avx-clear.o
./test
sse      : 24533145
sse_clear: 24286462
avx      : 64117779
avx2     : 62186716
avx_clear: 58684727
[hjl@gnu-skl-2 microbenchmark]$

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

Reply via email to