https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710
Bug ID: 69710
Summary: performance issue with SP Linpack with
Autovectorization
Product: gcc
Version: unknown
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: doug.gilmore at imgtec dot com
Target Milestone: ---
Created attachment 37614
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37614&action=edit
extracted daxpy example
We've noticed a performance problem in single precision
Linpack with the MSA patch applied:
https://gcc.gnu.org/ml/gcc-patches/2016-01/msg00177.html
which I have been able to reproduce with ARM Neon.
The problem that the autovectorization is generating more induction
variables for memory references in daxpy (this is an issue on all
architectures). That is, when the statement:
dy[i] = dy[i] + da*dx[i];
is vectorized the vector load associated with load of dy[i] uses
a different Induction Variable (IV) for the subsequent vector store
for dy[i]. For example, for ARM neon after vect we see:
<bb 12>:
# i_26 = PHI <i_44(11), i_19(20)>
# vectp_dy.12_83 = PHI <vectp_dy.13_81(11), vectp_dy.12_84(20)>
# vectp_dx.15_88 = PHI <vectp_dx.16_86(11), vectp_dx.15_89(20)>
# vectp_dy.20_96 = PHI <vectp_dy.21_94(11), vectp_dy.20_97(20)>
# ivtmp_99 = PHI <0(11), ivtmp_100(20)>
i.0_7 = (unsigned int) i_26;
_8 = i.0_7 * 4;
_10 = dy_9(D) + _8;
vect__12.14_85 = MEM[(float *)vectp_dy.12_83];
_12 = *_10;
_14 = dx_13(D) + _8;
vect__15.17_90 = MEM[(float *)vectp_dx.15_88];
_15 = *_14;
vect__16.18_92 = vect_cst__91 * vect__15.17_90;
_16 = da_6(D) * _15;
vect__17.19_93 = vect__12.14_85 + vect__16.18_92;
_17 = _12 + _16;
MEM[(float *)vectp_dy.20_96] = vect__17.19_93;
i_19 = i_26 + 1;
vectp_dy.12_84 = vectp_dy.12_83 + 16;
vectp_dx.15_89 = vectp_dx.15_88 + 16;
vectp_dy.20_97 = vectp_dy.20_96 + 16;
ivtmp_100 = ivtmp_99 + 1;
if (ivtmp_100 < bnd.9_53)
goto <bb 20>;
else
goto <bb 15>;
...
<bb 20>:
goto <bb 12>;
Note that the use of a separate IV for the load and store off of dy
can introduces a false memory dependency which causes poor scheduling
after unrolling. From what I have seen so far, for double precision
the ivopts phase is able to clean up the induction variables so the
false memory dependency is removed. However the cleanup does not
happen for single precision.
Attached simple example for single precision, more to follow.