https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117733
Bug ID: 117733
Summary: RISC-V SPEC2017 503.bwaves Inefficient fortran
multi-dimensional array access
Product: gcc
Version: 15.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: middle-end
Assignee: unassigned at gcc dot gnu.org
Reporter: vineetg at gcc dot gnu.org
CC: jeffreyalaw at gmail dot com, rdapp at gcc dot gnu.org
Target Milestone: ---
bwaves/cam4 has a bunch of fortran multi-dimensional array access and nested
loops to traverse them, and Vector codegen doesn't seem pretty and/or
efficient.
I have a reduced test:
subroutine shell(q,nx)
implicit none
integer nx,ny,nzl
real(kind=8) q(5,nx)
real(kind=8) dqnorm
integer l,i
dqnorm = 0.0d0
do i=1,nx
do l=1,5
dqnorm = dqnorm + q(l,i)*q(l,i)
enddo
enddo
call use_val(dqnorm)
return
end
-Ofast -ftree-vectorize -march=rv64gcv_zvl256b_zba_zbb_zbs
-mrvv-vector-bits=zvl -mabi=lp64d
The relevant output is not efficient
vsetivli zero,4,e64,m1,ta,ma
vmv.v.i v2,0
sh2add a4,a4,a4
addi t5,a0,32
vmv1r.v v3,v2
vmv1r.v v4,v2
vmv1r.v v1,v2
vmv1r.v v5,v2
addi t4,a0,64
addi t3,a0,96
addi t1,a0,128
li t6,20
li a3,4
.L3:
minu a5,a4,t6
minu a7,a5,a3
sub a5,a5,a7
minu a6,a5,a3
sub a5,a5,a6
minu a1,a5,a3
sub a5,a5,a1
vsetvli zero,a7,e64,m1,ta,ma
vle64.v v10,0(a0)
minu a2,a5,a3
vsetvli zero,a6,e64,m1,ta,ma
vle64.v v9,0(t5)
sub a5,a5,a2
vsetvli zero,a1,e64,m1,ta,ma
vle64.v v8,0(t4)
vsetvli zero,a5,e64,m1,ta,ma
vle64.v v7,0(t1)
vsetvli zero,a2,e64,m1,ta,ma
vle64.v v6,0(t3)
vsetvli zero,a7,e64,m1,tu,ma
vfmacc.vv v5,v10,v10
vsetvli zero,a6,e64,m1,tu,ma
vfmacc.vv v1,v9,v9
vsetvli zero,a1,e64,m1,tu,ma
vfmacc.vv v4,v8,v8
vsetvli zero,a5,e64,m1,tu,ma
vfmacc.vv v2,v7,v7
mv t0,a4
vsetvli zero,a2,e64,m1,tu,ma
vfmacc.vv v3,v6,v6
addi a0,a0,160
addi t5,t5,160
addi t4,t4,160
addi t1,t1,160
addi t3,t3,160
addi a4,a4,-20
bgtu t0,t6,.L3
...
(1) There is a VLE64 per element fetch / loop entry unrolled (but fortran is
column major, and elements accessed in inner loop are consecutive in memory.)
(2) Uses VL for predication which will runtime hit VL=0 which might be costly
on some uarches.
(3) There's all loads followed by all mac ops, vs. batching similar ops under
same VL.