https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120687
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Summary|RISC-V: very poor vector |RISC-V: unoptimal vector
|code gen for LMbench bw_mem |code gen for LMbench bw_mem
|test case |test case
Assignee|rguenth at gcc dot gnu.org |unassigned at gcc dot
gnu.org
CC| |rguenth at gcc dot gnu.org
Status|ASSIGNED |NEW
--- Comment #17 from Richard Biener <rguenth at gcc dot gnu.org> ---
This is now fixed as far as I am working on it right now. The code generated
for the 16 load case is now
.L6:
sub a5,a5,a2
vsetvli zero,a5,e32,m1,ta,ma
vle32.v v8,0(t4)
vsetvli zero,a6,e32,m1,ta,ma
vle32.v v7,0(a0)
vsetvli zero,a1,e32,m1,ta,ma
vle32.v v6,0(t5)
vsetvli zero,a2,e32,m1,ta,ma
vle32.v v5,0(t3)
vsetvli zero,a5,e32,m1,tu,ma
vadd.vv v2,v8,v2
vsetvli zero,a6,e32,m1,tu,ma
vadd.vv v4,v7,v4
vsetvli zero,a1,e32,m1,tu,ma
vadd.vv v1,v6,v1
vsetvli zero,a2,e32,m1,tu,ma
vadd.vv v3,v5,v3
sub a4,a4,t1
add t4,t4,a7
add a0,a0,a7
add t5,t5,a7
add t3,t3,a7
bgtu t6,t1,.L7
which is not optimal yet. The reason is the grouped load of size 16 which
is larger than the lower bound of the poly-nunits of the RVV vector type.
This causes us to limit the LEN to load the group.
For optimal code generation we'd need to re-roll, thus support a fractional
vectorization factor - a VF of 1/4 would be optimal here, but VF 1/16 should
work equally well.
An alternative to the above code-gen is to find an element type that can
be used with a struct-load, with TImode elements a ld4 would be possible,
and a single one would be guaranteed to cover the whole group.
But in the end a fractional VF is going to be the optimal solution.