https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122470
--- Comment #4 from Jeffrey A. Law <law at gcc dot gnu.org> ---
So what I find potentially ore interesting here is the failure of the RTL
optimizers to simplify that store-load sequence. THe problem is most likely
the sizes of the access:
(insn 7 4 10 2 (set (mem/j:QI (reg/v/f:DI 135 [ out ]) [1 out_4(D)->f_1+0 S1
A32])
(const_int 0 [0])) "j.c":10:46 282 {*movqi_internal}
(nil))
[ ... ]
(insn 12 11 13 2 (set (reg:DI 142)
(sign_extend:DI (mem/j:SI (reg/v/f:DI 135 [ out ]) [1 out_4(D)->f_2+-1
S4 A32]))) "j.c":10:46 125 {*extendsidi2_internal}
(nil))
(insn 13 12 14 2 (set (reg:SI 141)
(subreg:SI (reg:DI 142) 0)) "j.c":10:46 276 {*movsi_internal}
(nil))
(insn 14 13 15 2 (set (reg:DI 143)
(and:DI (subreg:DI (reg:SI 141) 0)
(const_int 255 [0xff]))) "j.c":10:46 104 {*anddi3}
(nil))
It's not until after CSE2 that things clean up meaningfully. But nothing after
CSE2 is likely to clean this up. Not sure the best path forward, but it's
clearly not a trivial problem.