https://gcc.gnu.org/bugzilla/show_bug.cgi?id=74585
--- Comment #13 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Bill Schmidt from comment #11) > With the original test case, -mcpu=power8 is problematic because of the use > of the "swapping stores," whose RHS is a vec_select rather than a register > or subreg. This prevents us from saving the RHS of the store for use in > replacing subsequent loads, running afoul of this logic in > dse.c:record_store (): > > if (GET_CODE (body) == SET > /* No place to keep the value after ra. */ > && !reload_completed > && (REG_P (SET_SRC (body)) <= this part > || GET_CODE (SET_SRC (body)) == SUBREG > || CONSTANT_P (SET_SRC (body))) > && !MEM_VOLATILE_P (mem) > /* Sometimes the store and reload is used for truncation and > > rounding. */ > && !(FLOAT_MODE_P (GET_MODE (mem)) && (flag_float_store))) > > We can circumvent this if we can use stvx to force the parameters to the > stack, which is legal since the stack slots are properly aligned. > > However, even using -mcpu=power9, we don't handle removing the stores and > replacing the partial loads with register logic. You mean stores like the following? (insn 13 12 14 2 (set (mem/c:V4SI (plus:DI (reg/f:DI 150 virtual-stack-vars) (const_int 112 [0x70])) [1 a+48 S16 A128]) (vec_select:V4SI (reg:V4SI 190) (parallel [ (const_int 2 [0x2]) (const_int 3 [0x3]) (const_int 0 [0]) (const_int 1 [0x1]) ]))) t.c:14 -1 (nil)) I wonder why dse can't simply force the rhs to a register? Of course if power really has stores that do this vec_select but no non-store with the operation then this might not be valid ... Now, in the end this example just shows that lowering register passing only at RTL expansion leads to a load of missed optimizations regarding to parameter setup ... some scheme to apply the lowering on GIMPLE already would be interesting to explore (but albeit quite a bit of work). We'd have a second set of "parameter decls" somewhere, like in struct function, and use that when the IL is on lowered form. Same for DECL_RESULT of course. And then the interesting part is whether to expose the stack in some way or restrict the lowering to decomposition/combining to registers.