https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113247
--- Comment #6 from JuzheZhong <juzhe.zhong at rivai dot ai> --- I have tried generic-ooo: https://compiler-explorer.com/z/44dcePczz There are still a few vectorized codes in the last couple lines of assembler: vsetivli zero,4,e32,m1,ta,ma addw a5,s10,t1 addw a0,a0,t3 addw a5,a5,t3 vmv.v.x v2,a0 ld s6,56(sp) vmv.v.x v1,a5 ld s5,88(sp) addw a3,s6,s4 addw a5,s5,a1 vslide1down.vx v2,v2,a3 ld s7,64(sp) vslide1down.vx v1,v1,a5 ld s5,96(sp) addw a3,s7,t4 addw a5,s5,a6 vslide1down.vx v2,v2,a3 vslide1down.vx v1,v1,a5 ld a5,72(sp) ld a6,104(sp) addw a3,a5,a7 vslide1down.vx v2,v2,a3 addw a5,a6,a4 vslide1down.vx v1,v1,a5 ld s1,120(sp) vse32.v v2,0(s1) addi a5,s1,16 vse32.v v1,0(a5) I suspect this will still lower down the performance. I can ask Li Pan to test it.