https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111153
--- Comment #4 from Robin Dapp <rdapp at gcc dot gnu.org> --- Yes, with VLS reduction this will improve. On aarch64 + sve I see loop inside costs: 2 This is similar to our VLS costs. And their loop is indeed short: ld1w z30.s, p7/z, [x0, x2, lsl 2] add x2, x2, x3 add z31.s, p7/m, z31.s, z30.s whilelo p7.s, w2, w1 b.any .L3 Not much to be squeezed out with a VLS approach. I guess that's why.