[Bug tree-optimization/120687] RISC-V: unoptimal vector code gen for LMbench bw_mem test case

rguenth at gcc dot gnu.org via Gcc-bugs Thu, 30 Oct 2025 01:20:00 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120687


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|RISC-V: very poor vector    |RISC-V: unoptimal vector
                   |code gen for LMbench bw_mem |code gen for LMbench bw_mem
                   |test case                   |test case
           Assignee|rguenth at gcc dot gnu.org         |unassigned at gcc dot 
gnu.org
                 CC|                            |rguenth at gcc dot gnu.org
             Status|ASSIGNED                    |NEW

--- Comment #17 from Richard Biener <rguenth at gcc dot gnu.org> ---
This is now fixed as far as I am working on it right now.  The code generated
for the 16 load case is now

.L6:
        sub     a5,a5,a2
        vsetvli zero,a5,e32,m1,ta,ma
        vle32.v v8,0(t4)
        vsetvli zero,a6,e32,m1,ta,ma
        vle32.v v7,0(a0)
        vsetvli zero,a1,e32,m1,ta,ma
        vle32.v v6,0(t5)
        vsetvli zero,a2,e32,m1,ta,ma
        vle32.v v5,0(t3)
        vsetvli zero,a5,e32,m1,tu,ma
        vadd.vv v2,v8,v2
        vsetvli zero,a6,e32,m1,tu,ma
        vadd.vv v4,v7,v4
        vsetvli zero,a1,e32,m1,tu,ma
        vadd.vv v1,v6,v1
        vsetvli zero,a2,e32,m1,tu,ma
        vadd.vv v3,v5,v3
        sub     a4,a4,t1
        add     t4,t4,a7
        add     a0,a0,a7
        add     t5,t5,a7
        add     t3,t3,a7
        bgtu    t6,t1,.L7

which is not optimal yet.  The reason is the grouped load of size 16 which
is larger than the lower bound of the poly-nunits of the RVV vector type.
This causes us to limit the LEN to load the group.

For optimal code generation we'd need to re-roll, thus support a fractional
vectorization factor - a VF of 1/4 would be optimal here, but VF 1/16 should
work equally well.

An alternative to the above code-gen is to find an element type that can
be used with a struct-load, with TImode elements a ld4 would be possible,
and a single one would be guaranteed to cover the whole group.

But in the end a fractional VF is going to be the optimal solution.

[Bug tree-optimization/120687] RISC-V: unoptimal vector code gen for LMbench bw_mem test case

Reply via email to