https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111153

--- Comment #4 from Robin Dapp <rdapp at gcc dot gnu.org> ---
Yes, with VLS reduction this will improve.

On aarch64 + sve I see
loop inside costs: 2
This is similar to our VLS costs.

And their loop is indeed short:

        ld1w    z30.s, p7/z, [x0, x2, lsl 2]
        add     x2, x2, x3
        add     z31.s, p7/m, z31.s, z30.s
        whilelo p7.s, w2, w1
        b.any   .L3

Not much to be squeezed out with a VLS approach.  I guess that's why.

Reply via email to