https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071
--- Comment #4 from H.J. Lu <hjl.tools at gmail dot com> --- (In reply to Peter Cordes from comment #2) > (In reply to H.J. Lu from comment #1) > > But > > > > vxorps %xmm0, %xmm0, %xmm0 > > vcvtsd2ss %xmm1, %xmm0, %xmm0 > > > > are faster than both. > > On Skylake-client (i7-6700k), I can't reproduce this result in a > hand-written asm loop. (I was using NASM to make a static executable that > runs a 100M iteration loop so I could measure with perf). Can you show some > asm where this performs better? Please try cvtsd2ss branch at: https://github.com/hjl-tools/microbenchmark/ On Intel Core i7-6700K, I got [hjl@gnu-skl-2 microbenchmark]$ make gcc -g -I. -c -o test.o test.c gcc -g -c -o sse.o sse.S gcc -g -c -o sse-clear.o sse-clear.S gcc -g -c -o avx.o avx.S gcc -g -c -o avx2.o avx2.S gcc -g -c -o avx-clear.o avx-clear.S gcc -o test test.o sse.o sse-clear.o avx.o avx2.o avx-clear.o ./test sse : 24533145 sse_clear: 24286462 avx : 64117779 avx2 : 62186716 avx_clear: 58684727 [hjl@gnu-skl-2 microbenchmark]$