https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108441
--- Comment #4 from Peter Cordes <peter at cordes dot ca> --- This is already fixed in current trunk; sorry I forgot to check that before recommending to report this store-coalescing bug. # https://godbolt.org/z/j3MdWrcWM # GCC nightly -O3 (tune=generic) and GCC11 store: movl $16, %eax movw %ax, ldap(%rip) ret In case anyone's wondering why GCC doesn't movw $16, foo(%rip) it's avoiding LCP stalls on Intel P6-family CPUs from the 16-bit immediate. For MOV specifically, that only happens on P6-family (Nehalem and earlier), not Sandybridge-family, so it's getting close to time to drop it from -mtune=generic. (-mtune= bdver* or znver* don't do it, so there is a tuning setting controlling it) GCC *only* seems to know about MOV, so ironically with -march=skylake for example, we avoid a non-existant LCP stall for mov to memory, but GCC compiles x += 1234 into code that will LCP stall, addw $1234, x(%rip). -march=alderlake disables this tuning workaround, using movw $imm, mem. (The Silvermont-family E-cores in Alder Lake don't have this problem either, so that's correct. Agner Fog's guide didn't mention any changes in LCP stalls for Alder Lake.) Avoiding LCP stalls is somewhat less important on CPUs with a uop cache, since it only happens on legacy decode. Although various things can cause code to only run from legacy decode even inside a loop, such as Skylake's JCC erratum microcode mitigation if users don't assemble with the option to have GAS work around it, which GCC doesn't pass by default with -march=skylake. If there isn't already a bug open about tuning choices mismatching hardware, I can repost this as a new bug if you'd like. Related :https://stackoverflow.com/questions/75154687/is-this-a-missed-optimization-in-gcc-loading-an-16-bit-integer-value-from-roda and https://stackoverflow.com/questions/70719114/why-does-the-short-16-bit-variable-mov-a-value-to-a-register-and-store-that-u