https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108441

--- Comment #4 from Peter Cordes <peter at cordes dot ca> ---
This is already fixed in current trunk; sorry I forgot to check that before
recommending to report this store-coalescing bug.

# https://godbolt.org/z/j3MdWrcWM
# GCC nightly -O3   (tune=generic)  and GCC11
store:
        movl    $16, %eax
        movw    %ax, ldap(%rip)
        ret

In case anyone's wondering why GCC doesn't  movw $16, foo(%rip)
it's avoiding LCP stalls on Intel P6-family CPUs from the 16-bit immediate.

For MOV specifically, that only happens on P6-family (Nehalem and earlier), not
Sandybridge-family, so it's getting close to time to drop it from
-mtune=generic.  (-mtune= bdver* or znver* don't do it, so there is a tuning
setting controlling it)

GCC *only* seems to know about MOV, so ironically with -march=skylake for
example, we avoid a non-existant LCP stall for mov to memory, but GCC compiles
x += 1234 into code that will LCP stall, addw $1234, x(%rip).

-march=alderlake disables this tuning workaround, using movw $imm, mem.  (The
Silvermont-family E-cores in Alder Lake don't have this problem either, so
that's correct.  Agner Fog's guide didn't mention any changes in LCP stalls for
Alder Lake.)

Avoiding LCP stalls is somewhat less important on CPUs with a uop cache, since
it only happens on legacy decode.  Although various things can cause code to
only run from legacy decode even inside a loop, such as Skylake's JCC erratum
microcode mitigation if users don't assemble with the option to have GAS work
around it, which GCC doesn't pass by default with -march=skylake.

If there isn't already a bug open about tuning choices mismatching hardware, I
can repost this as a new bug if you'd like.


Related
:https://stackoverflow.com/questions/75154687/is-this-a-missed-optimization-in-gcc-loading-an-16-bit-integer-value-from-roda

and
https://stackoverflow.com/questions/70719114/why-does-the-short-16-bit-variable-mov-a-value-to-a-register-and-store-that-u

Reply via email to