https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102125
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Component|c |target Last reconfirmed| |2021-08-30 Target| |arm Keywords| |missed-optimization Ever confirmed|0 |1 Status|UNCONFIRMED |NEW --- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> --- One common source of missed optimizations is gimple_fold_builtin_memory_op which has /* If we can perform the copy efficiently with first doing all loads and then all stores inline it that way. Currently efficiently means that we can load all the memory into a single integer register which is what MOVE_MAX gives us. */ src_align = get_pointer_alignment (src); dest_align = get_pointer_alignment (dest); if (tree_fits_uhwi_p (len) && compare_tree_int (len, MOVE_MAX) <= 0 ... /* If the destination pointer is not aligned we must be able to emit an unaligned store. */ && (dest_align >= GET_MODE_ALIGNMENT (mode) || !targetm.slow_unaligned_access (mode, dest_align) || (optab_handler (movmisalign_optab, mode) != CODE_FOR_nothing))) where here likely the MOVE_MAX limit applies (it is 4). Since we actually do need to perform two loads the code seems to do what is intended (but that's of course "bad" for 64bit copies on 32bit archs and likewise for 128bit copies on 64bit archs). It's usually too late for RTL memcpy expansion to fully elide stack storage.