https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #8 from Peter Cordes <peter at cordes dot ca> ---
Created attachment 45544
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45544&action=edit
testloop-cvtss2sd.asm

(In reply to H.J. Lu from comment #7)
> I fixed assembly codes and run it on different AVX machines.
> I got similar results:
> 
> ./test
> sse      : 28346518
> sse_clear: 28046302
> avx      : 28214775
> avx2     : 28251195
> avx_clear: 28092687
> 
> avx_clear:
>       vxorps  %xmm0, %xmm0, %xmm0
>       vcvtsd2ss       %xmm1, %xmm0, %xmm0
>       ret
> 
> is slightly faster.


I'm pretty sure that's a coincidence, or an unrelated microarchitectural effect
where adding any extra uop makes a difference.  Or just chance of code
alignment for the uop-cache (32-byte or maybe 64-byte boundaries).

You're still testing with the caller compiled without optimization.  The loop
is a mess of sign-extension and reloads, of course, but most importantly
keeping the loop counter in memory creates a dependency chain involving
store-forwarding latency.

Attempting a load later can make it succeed more quickly in store-forwarding
cases, on Intel Sandybridge-family, so perhaps an extra xor-zeroing uop is
reducing the average latency of the store/reloads for the loop counter (which
is probably the real bottleneck.)

https://stackoverflow.com/questions/49189685/adding-a-redundant-assignment-speeds-up-code-when-compiled-without-optimization

Loads are weird in general: the scheduler anticipates their latency and
dispatches uops that will consume their results in the cycle when it expects a
load will put the result on the forwarding network.  But if the load *isn't*
ready when expected, it may have to replay the uops that wanted that input. 
See
https://stackoverflow.com/questions/54084992/weird-performance-effects-from-nearby-dependent-stores-in-a-pointer-chasing-loop
for a detailed analysis of this effect on IvyBridge.  (Skylake doesn't have the
same restrictions on stores next to loads, but other effects can cause
replays.)

https://stackoverflow.com/questions/52351397/is-there-a-penalty-when-baseoffset-is-in-a-different-page-than-the-base/52358810#52358810
is an interesting case for pointer-chasing where the load port speculates that
it can use the base pointer for TLB lookups, instead of the base+offset. 
https://stackoverflow.com/questions/52527325/why-does-the-number-of-uops-per-iteration-increase-with-the-stride-of-streaming
shows load replays on cache misses.

So there's a huge amount of complicating factors from using a calling loop that
keeps its loop counter in memory, because SnB-family doesn't have a simple
fixed latency for store forwarding.


----


If I put the tests in a different order, I sometimes get results like:

./test
sse      : 26882815
sse_clear: 26207589
avx_clear: 25968108
avx      : 25920897
avx2     : 25956683

Often avx (with the false dep on the load result into XMM1) is slower than
avx_clear of avx2, but there's a ton of noise.

----

Adding vxorps  %xmm2, %xmm2, %xmm2  to avx.S also seems to have sped it up; now
it's the same speed as the others, even though I'm *not* breaking the
dependency chain anymore.  XMM2 is unrelated, nothing touches it.

This basically proves that your benchmark is sensitive to extra instructions,
whether they interact with vcvtsd2ss or not.


We know that in the general case, throwing in extra NOPs or xor-zeroing
instructions on unused registers does not make code faster, so we should
definitely distrust the result of this microbenchmark.




I've attached my NASM loop.  It has various commented-out loop bodies, and
notes in comments on results I found with performance counters.  I don't know
if it will be useful (because it's a bit messy), but it's what I use for
testing snippets of asm in a static binary with near-zero startup overhead.  I
just run perf stat on the whole executable and look at cycles / uops.

Reply via email to