https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071
--- Comment #6 from Peter Cordes <peter at cordes dot ca> --- (In reply to Peter Cordes from comment #5) > But whatever the effect is, it's totally unrelated to what you were *trying* > to test. :/ After adding a `ret` to each AVX function, all 5 are basically the same speed (compiling the C with `-O2` or -O2 -march=native), with just noise making it hard to see anything clearly. sse_clear tends to be faster than sse in a group of runs, but if there are differences it's more likely due to weird front-end effects and all the loads of inputs + store/reload of the return address by call/ret. I did while ./test; : ;done to factor out CPU clock-speed ramp up and maybe some cache warmup stuff, but it's still noisy from run to run. Making printf/write system calls between tests will cause TLB / branch-prediction effects because of kernel spectre mitigation, so I guess every test is in the same boat, running right after a system call. Adding loads and stores into the mix makes microbenchmarking a lot harder. Also notice that since `xmm0` and `xmm1` pointers are global, those pointers are reloaded every time through the loop even with optimization. I guess you're not trying to minimize the amount of work outside of the asm functions, to measure them as part of a messy loop. So for the version that have a false dependency, you're making that dependency on the result of this: mov rax,QWORD PTR [rip+0x2ebd] # reload xmm1 vmovapd xmm1,XMMWORD PTR [rax+rbx*1] # index xmm1 Anyway, I think there's too much noise in the data, and lots of reason to expect that vcvtsd2ss %xmm0, %xmm0, %xmm1 is strictly better than VPXOR+convert, except in cases where adding an extra uop actually helps, or where code-alignment effects matter.