https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118057
Andrew Waterman <andrew at sifive dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |andrew at sifive dot com
--- Comment #7 from Andrew Waterman <andrew at sifive dot com> ---
It is a more advanced optimization, but these known-constant-stride cases can
sometimes be more efficiently vectorized using masked unit-stride loads and
stores. (Implementations I've worked on execute the masked variants of these
instructions only slightly less efficiently than the unmasked ones.) For
example:
vsetivli x0, 25, e32, m8, ta, ma
li t0, 0x1111111
vmv.s.x v0, t0
loop:
vle32.v v8, (a5), v0.t
vse32.v v8, (a4), v0.t
addi a5, a5, 512
addi a4, a4, 512
bgeu a1, a5, loop