https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123190

--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
I've re-done measurements with -O2 -march=x86-64-v3 -flto with PGO on Zen4,
comparing GCC 15.2 against trunk.  It's still 8%.

Overhead       Samples  Command       Shared Object         Symbol              
  14.02%        127488  milc_peak.16  milc_peak.16          [.] u_shift_fermion
                                                                         ◆
  11.89%        106808  milc_peak.15  milc_peak.15          [.] u_shift_fermion
                                                                         ▒
  10.86%        107952  milc_peak.16  milc_peak.16          [.] mult_su3_na    
                                                                         ▒
   8.12%         73884  milc_peak.16  milc_peak.16          [.]
add_force_to_mom                                                               
         ▒
   8.06%         72382  milc_peak.15  milc_peak.15          [.]
add_force_to_mom                                                               
         ▒
   7.82%         82016  milc_peak.15  milc_peak.15          [.] mult_su3_na    
                                                                         ▒
   6.43%         60637  milc_peak.16  milc_peak.16          [.] mult_su3_nn    
                                                                         ▒
   5.46%         53641  milc_peak.15  milc_peak.15          [.] mult_su3_nn    
                                                                         ▒
   2.87%         25889  milc_peak.15  milc_peak.15          [.]
compute_gen_staple                                                             
         ▒
   2.57%         23385  milc_peak.15  milc_peak.15          [.] mult_su3_an    
                                                                         ▒
   2.40%         21858  milc_peak.16  milc_peak.16          [.] mult_su3_an    
                                                                         ▒
   2.35%         25496  milc_peak.16  milc_peak.16          [.] path_product   
                                                                         ▒
   2.29%         26044  milc_peak.15  milc_peak.15          [.] path_product   
                                                                         ▒
   1.79%         16213  milc_peak.16  milc_peak.16          [.]
compute_gen_staple

Like in u_shift_fermion there's inlined mult_su3_mat_vec which we now
similarly vectorize with a AVX2 + SSE combo rather than SSE + unrolling.

Enabling cost model comparison shows:

t.c:12:14: note:  Cost model analysis:
  Vector inside of loop cost: 444
  Vector prologue cost: 120
  Vector epilogue cost: 528
  Scalar iteration cost: 528
  Scalar outside cost: 8
  Vector outside cost: 648
  prologue iterations: 0
  epilogue iterations: 1
  Calculated minimum iters for profitability: 2

t.c:12:14: note:  Cost model analysis:
  Vector inside of loop cost: 268
  Vector prologue cost: 116
  Vector epilogue cost: 0
  Scalar iteration cost: 528
  Scalar outside cost: 8
  Vector outside cost: 116
  prologue iterations: 0
  epilogue iterations: 0
  Calculated minimum iters for profitability: 1

so the AVX2 version is considered better.

typedef struct {
   double real;
   double imag;
} complex;

typedef struct { complex e[3][3]; } su3_matrix;
typedef struct { complex c[3]; } su3_vector;

void mult_su3_mat_vec( su3_matrix *a, su3_vector *b, su3_vector *c ){
int i;
register double t,ar,ai,br,bi,cr,ci;
    for(i=0;i<3;i++){

        ar=a->e[i][0].real; ai=a->e[i][0].imag;
        br=b->c[0].real; bi=b->c[0].imag;
        cr=ar*br; t=ai*bi; cr -= t;
        ci=ar*bi; t=ai*br; ci += t;

        ar=a->e[i][1].real; ai=a->e[i][1].imag;
        br=b->c[1].real; bi=b->c[1].imag;
        t=ar*br; cr += t; t=ai*bi; cr -= t;
        t=ar*bi; ci += t; t=ai*br; ci += t;

        ar=a->e[i][2].real; ai=a->e[i][2].imag;
        br=b->c[2].real; bi=b->c[2].imag;
        t=ar*br; cr += t; t=ai*bi; cr -= t;
        t=ar*bi; ci += t; t=ai*br; ci += t;

        c->c[i].real=cr;
        c->c[i].imag=ci;
    }
}

code-gen wise we emit (this is how the SLP looks like)

  vect_ar_5.8_151 = MEM <vector(4) double> [(double *)vectp_a.6_149];
  vectp_a.6_152 = vectp_a.6_149 + 32;
  vect_ar_5.9_153 = MEM <vector(4) double> [(double *)vectp_a.6_152];
  vectp_a.6_154 = vectp_a.6_149 + 64;
  vect_ar_5.11_156 = VEC_PERM_EXPR <vect_ar_5.8_151, vect_ar_5.9_153, { 0, 0,
6, 6 }>;
  ar_5 = a_4(D)->e[i_46][0].real;
  vect_ai_6.14_162 = MEM <vector(4) double> [(double *)vectp_a.12_160];
  vectp_a.12_163 = vectp_a.12_160 + 32;
  vect_ai_6.15_164 = MEM <vector(4) double> [(double *)vectp_a.12_163];
  vectp_a.12_165 = vectp_a.12_160 + 64;
  vect_ar_5.17_167 = VEC_PERM_EXPR <vect_ai_6.14_162, vect_ai_6.15_164, { 1, 1,
7, 7 }>;

that's similar to the SSE version but shows the usual case of SLP degenerating
to uniform vectors with duplicated lanes and lack of SLP load node CSE
because we keep load-permutations rather than lowering (this isn't a usual
interleaving scheme).

Reply via email to