https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123190
--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
I've re-done measurements with -O2 -march=x86-64-v3 -flto with PGO on Zen4,
comparing GCC 15.2 against trunk. It's still 8%.
Overhead Samples Command Shared Object Symbol
14.02% 127488 milc_peak.16 milc_peak.16 [.] u_shift_fermion
◆
11.89% 106808 milc_peak.15 milc_peak.15 [.] u_shift_fermion
▒
10.86% 107952 milc_peak.16 milc_peak.16 [.] mult_su3_na
▒
8.12% 73884 milc_peak.16 milc_peak.16 [.]
add_force_to_mom
▒
8.06% 72382 milc_peak.15 milc_peak.15 [.]
add_force_to_mom
▒
7.82% 82016 milc_peak.15 milc_peak.15 [.] mult_su3_na
▒
6.43% 60637 milc_peak.16 milc_peak.16 [.] mult_su3_nn
▒
5.46% 53641 milc_peak.15 milc_peak.15 [.] mult_su3_nn
▒
2.87% 25889 milc_peak.15 milc_peak.15 [.]
compute_gen_staple
▒
2.57% 23385 milc_peak.15 milc_peak.15 [.] mult_su3_an
▒
2.40% 21858 milc_peak.16 milc_peak.16 [.] mult_su3_an
▒
2.35% 25496 milc_peak.16 milc_peak.16 [.] path_product
▒
2.29% 26044 milc_peak.15 milc_peak.15 [.] path_product
▒
1.79% 16213 milc_peak.16 milc_peak.16 [.]
compute_gen_staple
Like in u_shift_fermion there's inlined mult_su3_mat_vec which we now
similarly vectorize with a AVX2 + SSE combo rather than SSE + unrolling.
Enabling cost model comparison shows:
t.c:12:14: note: Cost model analysis:
Vector inside of loop cost: 444
Vector prologue cost: 120
Vector epilogue cost: 528
Scalar iteration cost: 528
Scalar outside cost: 8
Vector outside cost: 648
prologue iterations: 0
epilogue iterations: 1
Calculated minimum iters for profitability: 2
t.c:12:14: note: Cost model analysis:
Vector inside of loop cost: 268
Vector prologue cost: 116
Vector epilogue cost: 0
Scalar iteration cost: 528
Scalar outside cost: 8
Vector outside cost: 116
prologue iterations: 0
epilogue iterations: 0
Calculated minimum iters for profitability: 1
so the AVX2 version is considered better.
typedef struct {
double real;
double imag;
} complex;
typedef struct { complex e[3][3]; } su3_matrix;
typedef struct { complex c[3]; } su3_vector;
void mult_su3_mat_vec( su3_matrix *a, su3_vector *b, su3_vector *c ){
int i;
register double t,ar,ai,br,bi,cr,ci;
for(i=0;i<3;i++){
ar=a->e[i][0].real; ai=a->e[i][0].imag;
br=b->c[0].real; bi=b->c[0].imag;
cr=ar*br; t=ai*bi; cr -= t;
ci=ar*bi; t=ai*br; ci += t;
ar=a->e[i][1].real; ai=a->e[i][1].imag;
br=b->c[1].real; bi=b->c[1].imag;
t=ar*br; cr += t; t=ai*bi; cr -= t;
t=ar*bi; ci += t; t=ai*br; ci += t;
ar=a->e[i][2].real; ai=a->e[i][2].imag;
br=b->c[2].real; bi=b->c[2].imag;
t=ar*br; cr += t; t=ai*bi; cr -= t;
t=ar*bi; ci += t; t=ai*br; ci += t;
c->c[i].real=cr;
c->c[i].imag=ci;
}
}
code-gen wise we emit (this is how the SLP looks like)
vect_ar_5.8_151 = MEM <vector(4) double> [(double *)vectp_a.6_149];
vectp_a.6_152 = vectp_a.6_149 + 32;
vect_ar_5.9_153 = MEM <vector(4) double> [(double *)vectp_a.6_152];
vectp_a.6_154 = vectp_a.6_149 + 64;
vect_ar_5.11_156 = VEC_PERM_EXPR <vect_ar_5.8_151, vect_ar_5.9_153, { 0, 0,
6, 6 }>;
ar_5 = a_4(D)->e[i_46][0].real;
vect_ai_6.14_162 = MEM <vector(4) double> [(double *)vectp_a.12_160];
vectp_a.12_163 = vectp_a.12_160 + 32;
vect_ai_6.15_164 = MEM <vector(4) double> [(double *)vectp_a.12_163];
vectp_a.12_165 = vectp_a.12_160 + 64;
vect_ar_5.17_167 = VEC_PERM_EXPR <vect_ai_6.14_162, vect_ai_6.15_164, { 1, 1,
7, 7 }>;
that's similar to the SSE version but shows the usual case of SLP degenerating
to uniform vectors with duplicated lanes and lack of SLP load node CSE
because we keep load-permutations rather than lowering (this isn't a usual
interleaving scheme).