https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #6 from Peter Cordes <peter at cordes dot ca> ---
(In reply to Peter Cordes from comment #5)
> But whatever the effect is, it's totally unrelated to what you were *trying*
> to test. :/

After adding a `ret` to each AVX function, all 5 are basically the same speed
(compiling the C with `-O2` or -O2 -march=native), with just noise making it
hard to see anything clearly.  sse_clear tends to be faster than sse in a group
of runs, but if there are differences it's more likely due to weird front-end
effects and all the loads of inputs + store/reload of the return address by
call/ret.

I did  while ./test;  : ;done   to factor out CPU clock-speed ramp up and maybe
some cache warmup stuff, but it's still noisy from run to run.  Making
printf/write system calls between tests will cause TLB / branch-prediction
effects because of kernel spectre mitigation, so I guess every test is in the
same boat, running right after a system call.

Adding loads and stores into the mix makes microbenchmarking a lot harder.

Also notice that since `xmm0` and `xmm1` pointers are global, those pointers
are reloaded every time through the loop even with optimization.  I guess
you're not trying to minimize the amount of work outside of the asm functions,
to measure them as part of a messy loop.  So for the version that have a false
dependency, you're making that dependency on the result of this:

    mov    rax,QWORD PTR [rip+0x2ebd]      # reload xmm1
    vmovapd xmm1,XMMWORD PTR [rax+rbx*1]   # index xmm1

Anyway, I think there's too much noise in the data, and lots of reason to
expect that vcvtsd2ss %xmm0, %xmm0, %xmm1 is strictly better than
VPXOR+convert, except in cases where adding an extra uop actually helps, or
where code-alignment effects matter.

Reply via email to