https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071
--- Comment #8 from Peter Cordes <peter at cordes dot ca> --- Created attachment 45544 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45544&action=edit testloop-cvtss2sd.asm (In reply to H.J. Lu from comment #7) > I fixed assembly codes and run it on different AVX machines. > I got similar results: > > ./test > sse : 28346518 > sse_clear: 28046302 > avx : 28214775 > avx2 : 28251195 > avx_clear: 28092687 > > avx_clear: > vxorps %xmm0, %xmm0, %xmm0 > vcvtsd2ss %xmm1, %xmm0, %xmm0 > ret > > is slightly faster. I'm pretty sure that's a coincidence, or an unrelated microarchitectural effect where adding any extra uop makes a difference. Or just chance of code alignment for the uop-cache (32-byte or maybe 64-byte boundaries). You're still testing with the caller compiled without optimization. The loop is a mess of sign-extension and reloads, of course, but most importantly keeping the loop counter in memory creates a dependency chain involving store-forwarding latency. Attempting a load later can make it succeed more quickly in store-forwarding cases, on Intel Sandybridge-family, so perhaps an extra xor-zeroing uop is reducing the average latency of the store/reloads for the loop counter (which is probably the real bottleneck.) https://stackoverflow.com/questions/49189685/adding-a-redundant-assignment-speeds-up-code-when-compiled-without-optimization Loads are weird in general: the scheduler anticipates their latency and dispatches uops that will consume their results in the cycle when it expects a load will put the result on the forwarding network. But if the load *isn't* ready when expected, it may have to replay the uops that wanted that input. See https://stackoverflow.com/questions/54084992/weird-performance-effects-from-nearby-dependent-stores-in-a-pointer-chasing-loop for a detailed analysis of this effect on IvyBridge. (Skylake doesn't have the same restrictions on stores next to loads, but other effects can cause replays.) https://stackoverflow.com/questions/52351397/is-there-a-penalty-when-baseoffset-is-in-a-different-page-than-the-base/52358810#52358810 is an interesting case for pointer-chasing where the load port speculates that it can use the base pointer for TLB lookups, instead of the base+offset. https://stackoverflow.com/questions/52527325/why-does-the-number-of-uops-per-iteration-increase-with-the-stride-of-streaming shows load replays on cache misses. So there's a huge amount of complicating factors from using a calling loop that keeps its loop counter in memory, because SnB-family doesn't have a simple fixed latency for store forwarding. ---- If I put the tests in a different order, I sometimes get results like: ./test sse : 26882815 sse_clear: 26207589 avx_clear: 25968108 avx : 25920897 avx2 : 25956683 Often avx (with the false dep on the load result into XMM1) is slower than avx_clear of avx2, but there's a ton of noise. ---- Adding vxorps %xmm2, %xmm2, %xmm2 to avx.S also seems to have sped it up; now it's the same speed as the others, even though I'm *not* breaking the dependency chain anymore. XMM2 is unrelated, nothing touches it. This basically proves that your benchmark is sensitive to extra instructions, whether they interact with vcvtsd2ss or not. We know that in the general case, throwing in extra NOPs or xor-zeroing instructions on unused registers does not make code faster, so we should definitely distrust the result of this microbenchmark. I've attached my NASM loop. It has various commented-out loop bodies, and notes in comments on results I found with performance counters. I don't know if it will be useful (because it's a bit messy), but it's what I use for testing snippets of asm in a static binary with near-zero startup overhead. I just run perf stat on the whole executable and look at cycles / uops.