https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113247

--- Comment #6 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
I have tried generic-ooo:

https://compiler-explorer.com/z/44dcePczz

There are still a few vectorized codes in the last couple lines of assembler:

        vsetivli        zero,4,e32,m1,ta,ma
        addw    a5,s10,t1
        addw    a0,a0,t3
        addw    a5,a5,t3
        vmv.v.x v2,a0
        ld      s6,56(sp)
        vmv.v.x v1,a5
        ld      s5,88(sp)
        addw    a3,s6,s4
        addw    a5,s5,a1
        vslide1down.vx  v2,v2,a3
        ld      s7,64(sp)
        vslide1down.vx  v1,v1,a5
        ld      s5,96(sp)
        addw    a3,s7,t4
        addw    a5,s5,a6
        vslide1down.vx  v2,v2,a3
        vslide1down.vx  v1,v1,a5
        ld      a5,72(sp)
        ld      a6,104(sp)
        addw    a3,a5,a7
        vslide1down.vx  v2,v2,a3
        addw    a5,a6,a4
        vslide1down.vx  v1,v1,a5
        ld      s1,120(sp)
        vse32.v v2,0(s1)
        addi    a5,s1,16
        vse32.v v1,0(a5)

I suspect this will still lower down the performance.

I can ask Li Pan to test it.

Reply via email to